[jira] [Commented] (ORC-508) Add a reader/writer that does not depend on Hadoop FileSystem

Owen O'Malley (Jira) Thu, 16 Jan 2020 10:27:59 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017390#comment-17017390
 ]


Owen O'Malley commented on ORC-508:
-----------------------------------

The problem is that the main API for ORC uses a few of the Hadoop classes. The 
challenge is how to accomplish your goals without breaking backwards 
compatibility for the current users.

For example, to open an ORC file, the user does:
{code}
    Configuration conf = new Configuration();
    Reader reader = OrcFile.createReader(new Path("my-file.orc"),
                                         OrcFile.readerOptions(conf));
{code}

If you remove the Hadoop classes Path and Configuration, the main APIs break. 
That would cause a lot of pain to the users. So I was proposing an alternative 
"orc-nano-hadoop" module that contains a handful of classes that are drop in 
replacements for what the ORC API needs out of Hadoop. Then users will either 
need to include the hadoop jars or the orc-nano-hadoop jar depending on whether 
or not they want to read ORC files in a Hadoop ecosystem.

> Add a reader/writer that does not depend on Hadoop FileSystem
> -------------------------------------------------------------
>
>                 Key: ORC-508
>                 URL: https://issues.apache.org/jira/browse/ORC-508
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Ismaël Mejía
>            Priority: Major
>
> It seems that the default implementation classes of Orc today depend on 
> Hadoop FS objects to write. This is not ideal for APIs that do not rely on 
> Hadoop. For some context I was taking a look at adding support for Apache 
> Beam, but Beam's API supports multiple filesystems with a more generic 
> abstraction that relies on Java's Channels and Streams APIs and delegate 
> directly to Distributed FS e.g. Google Cloud Storage, Amazon S3, etc. It 
> would be really nice to have such support in the core implementation and to 
> maybe split the Hadoop dependencies implementation into its own module in the 
> future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ORC-508) Add a reader/writer that does not depend on Hadoop FileSystem

Reply via email to