[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO
[ https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540382#comment-15540382 ] ASF GitHub Bot commented on BEAM-674: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1025 > Add GridFS support to MongoDB IO > > > Key: BEAM-674 > URL: https://issues.apache.org/jira/browse/BEAM-674 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Daniel Kulp >Assignee: Daniel Kulp > Fix For: 0.3.0-incubating > > > MongoDB has an "extension" called GridFS that allows storing of very large > "files" into the MongoDB database in a relatively efficient way. It would > be good to add a GridFS API based IO to allow retrieving the data for > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO
[ https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533267#comment-15533267 ] ASF GitHub Bot commented on BEAM-674: - GitHub user dkulp opened a pull request: https://github.com/apache/incubator-beam/pull/1025 [BEAM-674] Gridfs Source refactoring Refactor of the GridFS based Source based on feedback from @jkff BoundedSource is now a source of ObjectID's and a separate DoFn is used to convert/parse the GridFSDBFile into usable chunks. Testcase for splitting added. Variables not needed by the Source are pulled out and stuck on the transform instead. Optimized the non-split case a bit by not querying all the ObjectIds up front. Optimize unit tests by setting up test data per class instead of per test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dkulp/incubator-beam gridfs-t2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1025.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1025 commit 5aad971bcd1d32ba06cec9d4870e7aa9e9dc17f5 Author: Daniel KulpDate: 2016-09-29T02:44:37Z Split BoundedSource into a BoundedSource and a DoFn<...> commit 2fc219cdd33e89d65d457dd3767bd378ffc0 Author: Daniel Kulp Date: 2016-09-29T13:03:31Z Optimize reading for non-split case commit e58fc61868988cc40c325d913fca37b26e3db99c Author: Daniel Kulp Date: 2016-09-29T13:18:17Z Use objectId timestamp commit ed73d77b21651d6ef1d8cf2892dc267794d52d10 Author: Daniel Kulp Date: 2016-09-29T13:57:44Z Pull parser out of BoundedSource, add maxSkew commit 277667527cf0a23704b3ae3d05b2c8e2c2bcea3c Author: Daniel Kulp Date: 2016-09-29T14:48:42Z Add test case for the split commit db30aabac4629ae167e4ede73de79257b4a93336 Author: Daniel Kulp Date: 2016-09-29T15:00:44Z Don't need the generic on the Source and Reader commit 1cdb2ce716b7e020c5306494b414b5bb136abb24 Author: Daniel Kulp Date: 2016-09-29T16:29:51Z Rename maxSkew to allowedTimestampSkew to match other DoFn's > Add GridFS support to MongoDB IO > > > Key: BEAM-674 > URL: https://issues.apache.org/jira/browse/BEAM-674 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Daniel Kulp >Assignee: Daniel Kulp > Fix For: 0.3.0-incubating > > > MongoDB has an "extension" called GridFS that allows storing of very large > "files" into the MongoDB database in a relatively efficient way. It would > be good to add a GridFS API based IO to allow retrieving the data for > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO
[ https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530050#comment-15530050 ] ASF GitHub Bot commented on BEAM-674: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1003 > Add GridFS support to MongoDB IO > > > Key: BEAM-674 > URL: https://issues.apache.org/jira/browse/BEAM-674 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Daniel Kulp >Assignee: Daniel Kulp > Fix For: 0.3.0-incubating > > > MongoDB has an "extension" called GridFS that allows storing of very large > "files" into the MongoDB database in a relatively efficient way. It would > be good to add a GridFS API based IO to allow retrieving the data for > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO
[ https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523335#comment-15523335 ] ASF GitHub Bot commented on BEAM-674: - GitHub user dkulp opened a pull request: https://github.com/apache/incubator-beam/pull/1003 [BEAM-674] Source part of GridFS IO This is the "Source" part for GridFS based IO for beam. (will work on Sink next, but would like to get this reviewed and merged first) . The "default" is to parse each file as text files (by line), but a parser function can be provided to take the InputStream and parse via whatever is required. For runners that can split into bundles it attempts to assign files in the grid to different bundles. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dkulp/incubator-beam gridfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1003.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1003 commit d5cdc2429622f65a762774de8b5baf15334e55e2 Author: Daniel KulpDate: 2016-09-16T20:58:56Z Add GridFS io commit a9212662744c14f10cd811540c3e9268c32c25c4 Author: Daniel Kulp Date: 2016-09-16T21:19:50Z Fix checkstyle issues commit cee0a06b6a465a276c2c5410d7d3f9af703982d4 Author: Daniel Kulp Date: 2016-09-19T17:22:44Z Attempt to get a converter in there commit fafa8fa607f22eacf918abb13419f28df9d2a8e9 Author: Daniel Kulp Date: 2016-09-19T17:32:44Z Fix javac compile problem commit 7e9872f12c74902f1a23e5a27eb0027ae753947a Author: Daniel Kulp Date: 2016-09-19T17:50:11Z Force a serializable commit 265747946864b226235ee5b758e6c10b7cc3992f Author: Daniel Kulp Date: 2016-09-19T17:56:03Z Add the needed coder commit 4f54495afe7ff4768d873350c345d39905d812fc Author: Daniel Kulp Date: 2016-09-19T18:02:39Z Change to using the GridFSDBFile instead of InputStream so the parsingFn can have access to tall the metadata commit cbeebf02542a5e5a5f4b9a6c370b1b68b46d2deb Author: Daniel Kulp Date: 2016-09-19T18:25:23Z Flip to allowing the parser to have complete control over how the item is added to the collection commit a08007b9f444fedcde78ab38c6cdf505b3864c61 Author: Daniel Kulp Date: 2016-09-19T18:26:19Z Fix unused imports commit a4840e98d891d3fa783654a472af06c4d399a929 Author: Daniel Kulp Date: 2016-09-21T19:51:45Z Add test for the parser functionality and cleanup some of that code commit 438a792a796be77186d79aa3fdb221efcced6d4f Author: Daniel Kulp Date: 2016-09-21T20:01:33Z Move the coder out from the parser commit e8fcdbf3cebd6fa4648f328484dee07fec35b21a Author: Daniel Kulp Date: 2016-09-22T12:36:50Z Fix test commit 1d1a373fc7cec4e78bf0e618a902c15005fc36b4 Author: Daniel Kulp Date: 2016-09-23T14:49:53Z Flip to using BoundedSource so it can be broken up into bundles > Add GridFS support to MongoDB IO > > > Key: BEAM-674 > URL: https://issues.apache.org/jira/browse/BEAM-674 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Daniel Kulp >Assignee: Daniel Kulp > > MongoDB has an "extension" called GridFS that allows storing of very large > "files" into the MongoDB database in a relatively efficient way. It would > be good to add a GridFS API based IO to allow retrieving the data for > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO
[ https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523309#comment-15523309 ] Jean-Baptiste Onofré commented on BEAM-674: --- Ready to review. > Add GridFS support to MongoDB IO > > > Key: BEAM-674 > URL: https://issues.apache.org/jira/browse/BEAM-674 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Daniel Kulp >Assignee: Jean-Baptiste Onofré > > MongoDB has an "extension" called GridFS that allows storing of very large > "files" into the MongoDB database in a relatively efficient way. It would > be good to add a GridFS API based IO to allow retrieving the data for > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)