[ 
https://issues.apache.org/jira/browse/ORC-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola updated ORC-633:
-----------------------
    Description: 
I am reading a path with ORC files using flink. However, some of them are 
broken.

I get exceptions like this:
{code:java}
org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc 
(maxFileLength= 9223372036854775807) 
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) 
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) 
at org.apache.orc.OrcFile.createReader(OrcFile.java:342) 
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) 
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) 
at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
 
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) 
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530){code}
 

I have also defined in my configuration the "skip corrupt file":
{code:java}
conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);{code}
 

but it only handles a specific case and it doesn't skip broken files. 

Is it possible to not throw exception on any kind of broken ORC files and only 
return the valid ones?

  was:
I am reading a path with ORC files using flink. However, some of them are 
broken.

I get exceptions like this:
{code:java}
org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc 
(maxFileLength= 9223372036854775807) 
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) 
at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) 
at org.apache.orc.OrcFile.createReader(OrcFile.java:342) 
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) 
at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) 
at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
 
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) 
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530){code}
 

I have also defined in my configuration the "skip corrupt file":


{code:java}
conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);{code}
 

but it only handles a specific case and it doesn't skip broken files. 

Is it possible to skip all broken ORC files for whatever reason and only take 
the valid ones?


> Skip broken ORC files when reading
> ----------------------------------
>
>                 Key: ORC-633
>                 URL: https://issues.apache.org/jira/browse/ORC-633
>             Project: ORC
>          Issue Type: Improvement
>          Components: Reader
>    Affects Versions: 1.6.3
>            Reporter: Nikola
>            Priority: Critical
>
> I am reading a path with ORC files using flink. However, some of them are 
> broken.
> I get exceptions like this:
> {code:java}
> org.apache.orc.FileFormatException: Not a valid ORC file /user/orc/0.orc 
> (maxFileLength= 9223372036854775807) 
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:546) 
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) 
> at org.apache.orc.OrcFile.createReader(OrcFile.java:342) 
> at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225) 
> at org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63) 
> at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
>  
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) 
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530){code}
>  
> I have also defined in my configuration the "skip corrupt file":
> {code:java}
> conf.setBoolean(OrcConf.SKIP_CORRUPT_DATA.getAttribute(), true);{code}
>  
> but it only handles a specific case and it doesn't skip broken files. 
> Is it possible to not throw exception on any kind of broken ORC files and 
> only return the valid ones?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to