[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992322#comment-15992322 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2777 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989699#comment-15989699 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2777 [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2777.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2777 commit c792da69467ccb0a486fe74ff7df0e5e9cb54abf Author: Luke CwikDate: 2017-04-29T02:07:59Z [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989688#comment-15989688 ] ASF GitHub Bot commented on BEAM-2005: -- Github user lukecwik closed the pull request at: https://github.com/apache/beam/pull/2776 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989683#comment-15989683 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2776 [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2776.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2776 commit ecfb534c79cb38bdf378dffe25d6a7ee3e20e5c6 Author: Luke CwikDate: 2017-04-29T01:37:03Z [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HDFS. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989625#comment-15989625 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2772 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989592#comment-15989592 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2772 [BEAM-2005] Swap to use Lists within MatchResult instead of arrays. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Everywhere in our codebase we go through and want to have lists as the return type. This also AutoValue's them. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2772.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2772 commit f9fe921cb64a19a092590644e1d94a4b202db3c9 Author: Luke CwikDate: 2017-04-28T23:27:53Z [BEAM-2005] Swap to use Lists within MatchResult instead of arrays. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989245#comment-15989245 ] ASF GitHub Bot commented on BEAM-2005: -- Github user tgroh closed the pull request at: https://github.com/apache/beam/pull/2763 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989203#comment-15989203 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user tgroh opened a pull request: https://github.com/apache/beam/pull/2763 Revert "[BEAM-2005] Move getScheme from FileSystemRegistrar to FileSy… …stem" This reverts commit ce88c88b14e963ac17fac83dd19495c835c1f6cb. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/beam revert_getscheme_fs_registrar Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2763 commit 25f9b4a4b1001549e15989eb4d94b6c2978872ac Author: Thomas GrohDate: 2017-04-28T17:30:15Z Revert "[BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem" This reverts commit ce88c88b14e963ac17fac83dd19495c835c1f6cb. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989071#comment-15989071 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2757 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987992#comment-15987992 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2757 [BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem Note that I needed to update FileSystems to instantiate the FileSystem(s) upfront instead of remembering the mapping from scheme to registrar during "registration" of the PipelineOptions. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2757.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2757 commit a2ccc739f7dc2e6f606fafd6e158b63efe190b10 Author: Luke CwikDate: 2017-04-28T01:13:43Z [BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem Note that I needed to update FileSystems to instantiate the FileSystem(s) upfront instead of remembering the mapping from scheme to registrar. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982303#comment-15982303 ] Jean-Baptiste Onofré commented on BEAM-2005: +1 (sorry I'm late on this, I was busy with CassandraIO and Spark 2 support). Gonna help you in the coming days. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982174#comment-15982174 ] Stephen Sisk commented on BEAM-2005: also - all the sub-tasks on this are assigned to me since we don't want FSR items unassigned, but I'm only currently working on one of them. Let me know if you're interested on working on this with me. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982100#comment-15982100 ] Stephen Sisk commented on BEAM-2005: Also, I filed https://issues.apache.org/jira/browse/BEAM-2069 (Remove FileSystem.getCurrentDirectory()?) based on an issue I ran into while implementing HadoopResourceId. For now, I'm having getCurrentDirectory() throw NotImplementedException() > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982030#comment-15982030 ] Stephen Sisk commented on BEAM-2005: note that adding support for other Hadoop FileSystem should be simple (I'm not doing anything to block them) and might be as straightforward as adding another FileSystemRegistrar that returns the HadoopFileSystem for different schemes. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981966#comment-15981966 ] Stephen Sisk commented on BEAM-2005: Since I haven't heard anything and we are trying to finish soon for FSR, I'm doing as follows: 1. Implement a simple version of FileSystem that supports only HDFS, and only one configuration and gets that configuration via pipelineOptions that looks like a json blob 2. It will be checked in at the same location as the existing HadoopFileSystem.java (and thus, this will live in extensions) I like the idea of the scheme -> module mapping, but that probably belongs in another JIRA if we want it. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977537#comment-15977537 ] Stephen Sisk commented on BEAM-2005: There are a couple possible ways we could adapt to this: * Potentially we could give different connections different schemas, but that falls apart if someone wants to use URIs generated elsewhere * Start passing in the FileSystem object as an option on the read transform (like hadoop does) - this also incidentally solves the problem of "how will people know if hdfs is in the set of modules loaded on your system" problem that was discussed above - they'll need to instantiate the instance themselves and they'll go through their normal discovery mechanism for doing so. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976952#comment-15976952 ] Stephen Sisk commented on BEAM-2005: I don't want to derail this conversation, but I did have a couple other concerns - Beam's FileSystem has a copy() command, however I can't find a good analog in Hadoop's FileSystem. https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html shows lots of copy to/from local files, but no "copy between these two arbitrary paths". I also believe that since Beam FileSystem objects are configured via PipelineOptions, we need to pass a Hadoop Configuration through PipelineOptions. I think that's very solvable, but it does seem semi-complicated. I'm going to open subtasks for discussion so we can discuss in separate threads. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976046#comment-15976046 ] Aviem Zur commented on BEAM-2005: - Regarding {{core}} vs {{extensions}}. This can reside in a separate module from core, but I think that core should depend on it so users get this functionality out of the box. For example, if a user uses {{TextIO}} and it works for them when passing {{"file://path/to/file"}}, changing this to {{"hdfs://path/to/file"}} should work without the need to add a new dependency to their project. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975557#comment-15975557 ] Stephen Sisk commented on BEAM-2005: [~aviemzur] correct, that's what we'll need to do. [~jbonofre] I've also started working on this - mind sharing your code? Maybe in a feature branch of the beam repo where we could collaborate? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975196#comment-15975196 ] Aviem Zur commented on BEAM-2005: - Wouldn't this ticket mean actually implementing https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java ? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975193#comment-15975193 ] Aviem Zur commented on BEAM-2005: - I'll refine - it should be in a separate module, but I think core should depend on it (i.e. out of the box functionality). > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975164#comment-15975164 ] Eugene Kirpichov commented on BEAM-2005: After this is done, and before the first stable release, we should delete https://github.com/apache/beam/tree/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975001#comment-15975001 ] Daniel Halperin commented on BEAM-2005: --- `core` vs `extensions` -- this won't be in `sdk-java-core` itself, it will probably be in `sdk-java-extensions-hadoop` or whatever (just like `GcsFileSystem` either is moving or has moved to the new `sdks-java-extensions-gcp-core`). I could see also `sdk-java-io-hadoop`, but I think this is a reasonable-ish use of `core`. Our JIRA tags are not perfect. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974991#comment-15974991 ] Jean-Baptiste Onofré commented on BEAM-2005: I don't mind about core or extension (it could make sense to have filesystem extension). Full agree that it's a must have for first stable release. That's why I started to focus on this last week end. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974980#comment-15974980 ] Aviem Zur commented on BEAM-2005: - I think this should be in `core` not in `extensions`. Also, I think this should be a blocker for first stable release. WDYT? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974783#comment-15974783 ] Jean-Baptiste Onofré commented on BEAM-2005: I'm thinking about something like: {code} TextIO.from("hdfs://credentialTag@namenode/path/to/file") {code} [~sisk] [~dhalp...@google.com] Thoughts ? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974779#comment-15974779 ] Jean-Baptiste Onofré commented on BEAM-2005: By the way, an important part in HDFS is the support of Kerberos. I think it would require a couple of additional methods (as we have for Google Storage) around kerberos authentication. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974120#comment-15974120 ] Jean-Baptiste Onofré commented on BEAM-2005: Fully agree, I started this + S3 + Azure (as there's some slightly difference). I'm also experimenting the MongoDB GridFS filesystem. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)