[ 
https://issues.apache.org/jira/browse/FLINK-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Sellero updated FLINK-13292:
--------------------------------------
    Description: 
When I try to read an Orc file using flink-orc an NullPointerException 
exception is thrown.
I think this issue could be related with this closed issue 
https://issues.apache.org/jira/browse/FLINK-8230

This happens when trying to read the string fields in a nested struct. This is 
my schema:
{code:java}
      "struct<" +
        "operation:int," +
        "originalTransaction:bigInt," +
        "bucket:int," +
        "rowId:bigInt," +
        "currentTransaction:bigInt," +
        "row:struct<" +
        "id:int," +
        "headline:string," +
        "user_id:int," +
        "company_id:int," +
        "created_at:timestamp," +
        "updated_at:timestamp," +
        "link:string," +
        "is_html:tinyint," +
        "source:string," +
        "company_feed_id:int," +
        "editable:tinyint," +
        "body_clean:string," +
        "activitystream_activity_id:bigint," +
        "uniqueness_checksum:string," +
        "rating:string," +
        "review_id:int," +
        "soft_deleted:tinyint," +
        "type:string," +
        "metadata:string," +
        "url:string," +
        "imagecache_uuid:string," +
        "video_id:int" +
        ">>",{code}
{code:java}
[error] Caused by: java.lang.NullPointerException
[error]         at java.lang.String.checkBounds(String.java:384)
[error]         at java.lang.String.<init>(String.java:462)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250)
[error]         at 
org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143)
[error]         at 
org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333)
[error]         at 
org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313)
[error]         at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
[error]         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
[error]         at java.lang.Thread.run(Thread.java:748){code}
Instead to use the TableApi I am trying to read the orc files in the Batch mode 
as following:
{code:java}
      env
        .readFile(
          new OrcRowInputFormat(
          "",
          "SCHEMA_GIVEN_BEFORE",
          new HadoopConfiguration()
        ),
        "PATH_TO_FOLDER"
        )
        .writeAsText("file:///tmp/test/fromOrc")
{code}


Thanks for your support

  was:
When I try to read an Orc file using flink-orc an NullPointerException 
exception is thrown.
I think this issue could be related with this closed issue 
https://issues.apache.org/jira/browse/FLINK-8230

This happens when trying to read the string fields in a nested struct. This is 
my schema:
{code:java}
      "struct<" +
        "operation:int," +
        "originalTransaction:bigInt," +
        "bucket:int," +
        "rowId:bigInt," +
        "currentTransaction:bigInt," +
        "row:struct<" +
        "id:int," +
        "headline:string," +
        "user_id:int," +
        "company_id:int," +
        "created_at:timestamp," +
        "updated_at:timestamp," +
        "link:string," +
        "is_html:tinyint," +
        "source:string," +
        "company_feed_id:int," +
        "editable:tinyint," +
        "body_clean:string," +
        "activitystream_activity_id:bigint," +
        "uniqueness_checksum:string," +
        "rating:string," +
        "kununu_review_id:int," +
        "soft_deleted:tinyint," +
        "type:string," +
        "metadata:string," +
        "url:string," +
        "imagecache_uuid:string," +
        "video_id:int" +
        ">>",{code}
{code:java}
[error] Caused by: java.lang.NullPointerException
[error]         at java.lang.String.checkBounds(String.java:384)
[error]         at java.lang.String.<init>(String.java:462)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453)
[error]         at 
org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250)
[error]         at 
org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143)
[error]         at 
org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333)
[error]         at 
org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313)
[error]         at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
[error]         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
[error]         at java.lang.Thread.run(Thread.java:748){code}
Instead to use the TableApi I am trying to read the orc files in the Batch mode 
as following:
{code:java}
      env
        .readFile(
          new OrcRowInputFormat(
          "",
          "SCHEMA_GIVEN_BEFORE",
          new HadoopConfiguration()
        ),
        "PATH_TO_FOLDER"
        )
        .writeAsText("file:///tmp/test/fromOrc")
{code}


Thanks for your support


> NullPointerException when reading a string field in a nested struct from an 
> Orc file.
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-13292
>                 URL: https://issues.apache.org/jira/browse/FLINK-13292
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / ORC
>    Affects Versions: 1.8.0
>            Reporter: Alejandro Sellero
>            Priority: Major
>
> When I try to read an Orc file using flink-orc an NullPointerException 
> exception is thrown.
> I think this issue could be related with this closed issue 
> https://issues.apache.org/jira/browse/FLINK-8230
> This happens when trying to read the string fields in a nested struct. This 
> is my schema:
> {code:java}
>       "struct<" +
>         "operation:int," +
>         "originalTransaction:bigInt," +
>         "bucket:int," +
>         "rowId:bigInt," +
>         "currentTransaction:bigInt," +
>         "row:struct<" +
>         "id:int," +
>         "headline:string," +
>         "user_id:int," +
>         "company_id:int," +
>         "created_at:timestamp," +
>         "updated_at:timestamp," +
>         "link:string," +
>         "is_html:tinyint," +
>         "source:string," +
>         "company_feed_id:int," +
>         "editable:tinyint," +
>         "body_clean:string," +
>         "activitystream_activity_id:bigint," +
>         "uniqueness_checksum:string," +
>         "rating:string," +
>         "review_id:int," +
>         "soft_deleted:tinyint," +
>         "type:string," +
>         "metadata:string," +
>         "url:string," +
>         "imagecache_uuid:string," +
>         "video_id:int" +
>         ">>",{code}
> {code:java}
> [error] Caused by: java.lang.NullPointerException
> [error]       at java.lang.String.checkBounds(String.java:384)
> [error]       at java.lang.String.<init>(String.java:462)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250)
> [error]       at 
> org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143)
> [error]       at 
> org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333)
> [error]       at 
> org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313)
> [error]       at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
> [error]       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
> [error]       at java.lang.Thread.run(Thread.java:748){code}
> Instead to use the TableApi I am trying to read the orc files in the Batch 
> mode as following:
> {code:java}
>       env
>         .readFile(
>           new OrcRowInputFormat(
>           "",
>           "SCHEMA_GIVEN_BEFORE",
>           new HadoopConfiguration()
>         ),
>         "PATH_TO_FOLDER"
>         )
>         .writeAsText("file:///tmp/test/fromOrc")
> {code}
> Thanks for your support



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to