[jira] [Commented] (ORC-508) Add a reader/writer that does not depend on Hadoop FileSystem

Owen O'Malley (Jira) Fri, 22 Jan 2021 10:34:05 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270349#comment-17270349
 ]


Owen O'Malley commented on ORC-508:
-----------------------------------

Ok, I'm going to take a look at this today. At a high level, here are how many 
references to each of the Hadoop classes that we currently have:

{{  46 import org.apache.hadoop.fs.Path;
  46 import org.apache.hadoop.conf.Configuration;
  39 import org.apache.hadoop.fs.FileSystem;
  12 import org.apache.hadoop.fs.FSDataInputStream;
  10 import org.apache.hadoop.io.Text;
   6 import org.apache.hadoop.io.BytesWritable;
   5 import org.apache.hadoop.fs.FileStatus;
   5 import org.apache.hadoop.fs.FSDataOutputStream;
   3 import org.apache.hadoop.util.Progressable;
   3 import org.apache.hadoop.fs.permission.FsPermission;
   2 import org.apache.hadoop.io.DataOutputBuffer;
   2 import org.apache.hadoop.fs.Seekable;
   2 import org.apache.hadoop.fs.PositionedReadable;
   1 import org.apache.hadoop.util.VersionInfo;
   1 import org.apache.hadoop.io.WritableComparator;
   1 import org.apache.hadoop.io.IntWritable;}}

> Add a reader/writer that does not depend on Hadoop FileSystem
> -------------------------------------------------------------
>
>                 Key: ORC-508
>                 URL: https://issues.apache.org/jira/browse/ORC-508
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Ismaël Mejía
>            Priority: Major
>
> It seems that the default implementation classes of Orc today depend on 
> Hadoop FS objects to write. This is not ideal for APIs that do not rely on 
> Hadoop. For some context I was taking a look at adding support for Apache 
> Beam, but Beam's API supports multiple filesystems with a more generic 
> abstraction that relies on Java's Channels and Streams APIs and delegate 
> directly to Distributed FS e.g. Google Cloud Storage, Amazon S3, etc. It 
> would be really nice to have such support in the core implementation and to 
> maybe split the Hadoop dependencies implementation into its own module in the 
> future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ORC-508) Add a reader/writer that does not depend on Hadoop FileSystem

Reply via email to