[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO

2016-10-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540382#comment-15540382
 ] 

ASF GitHub Bot commented on BEAM-674:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1025


> Add GridFS support to MongoDB IO
> 
>
> Key: BEAM-674
> URL: https://issues.apache.org/jira/browse/BEAM-674
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Daniel Kulp
>Assignee: Daniel Kulp
> Fix For: 0.3.0-incubating
>
>
> MongoDB has an "extension" called GridFS that allows storing of very large 
> "files" into the MongoDB database in a relatively efficient way.   It would 
> be good to add a GridFS API based IO to allow retrieving the data for 
> processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO

2016-09-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533267#comment-15533267
 ] 

ASF GitHub Bot commented on BEAM-674:
-

GitHub user dkulp opened a pull request:

https://github.com/apache/incubator-beam/pull/1025

[BEAM-674] Gridfs Source refactoring

Refactor of the GridFS based Source based on feedback from @jkff 

BoundedSource is now a source of ObjectID's and a separate DoFn is used to 
convert/parse the GridFSDBFile into usable chunks.   

Testcase for splitting added.

Variables not needed by the Source are pulled out and stuck on the 
transform instead.

Optimized the non-split case a bit by not querying all the ObjectIds up 
front.  

Optimize unit tests by setting up test data per class instead of per test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dkulp/incubator-beam gridfs-t2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1025


commit 5aad971bcd1d32ba06cec9d4870e7aa9e9dc17f5
Author: Daniel Kulp 
Date:   2016-09-29T02:44:37Z

Split BoundedSource into a BoundedSource and a DoFn<...>

commit 2fc219cdd33e89d65d457dd3767bd378ffc0
Author: Daniel Kulp 
Date:   2016-09-29T13:03:31Z

Optimize reading for non-split case

commit e58fc61868988cc40c325d913fca37b26e3db99c
Author: Daniel Kulp 
Date:   2016-09-29T13:18:17Z

Use objectId timestamp

commit ed73d77b21651d6ef1d8cf2892dc267794d52d10
Author: Daniel Kulp 
Date:   2016-09-29T13:57:44Z

Pull parser out of BoundedSource, add maxSkew

commit 277667527cf0a23704b3ae3d05b2c8e2c2bcea3c
Author: Daniel Kulp 
Date:   2016-09-29T14:48:42Z

Add test case for the split

commit db30aabac4629ae167e4ede73de79257b4a93336
Author: Daniel Kulp 
Date:   2016-09-29T15:00:44Z

Don't need the generic on the Source and Reader

commit 1cdb2ce716b7e020c5306494b414b5bb136abb24
Author: Daniel Kulp 
Date:   2016-09-29T16:29:51Z

Rename maxSkew to allowedTimestampSkew to match other DoFn's




> Add GridFS support to MongoDB IO
> 
>
> Key: BEAM-674
> URL: https://issues.apache.org/jira/browse/BEAM-674
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Daniel Kulp
>Assignee: Daniel Kulp
> Fix For: 0.3.0-incubating
>
>
> MongoDB has an "extension" called GridFS that allows storing of very large 
> "files" into the MongoDB database in a relatively efficient way.   It would 
> be good to add a GridFS API based IO to allow retrieving the data for 
> processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO

2016-09-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530050#comment-15530050
 ] 

ASF GitHub Bot commented on BEAM-674:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1003


> Add GridFS support to MongoDB IO
> 
>
> Key: BEAM-674
> URL: https://issues.apache.org/jira/browse/BEAM-674
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Daniel Kulp
>Assignee: Daniel Kulp
> Fix For: 0.3.0-incubating
>
>
> MongoDB has an "extension" called GridFS that allows storing of very large 
> "files" into the MongoDB database in a relatively efficient way.   It would 
> be good to add a GridFS API based IO to allow retrieving the data for 
> processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523335#comment-15523335
 ] 

ASF GitHub Bot commented on BEAM-674:
-

GitHub user dkulp opened a pull request:

https://github.com/apache/incubator-beam/pull/1003

[BEAM-674] Source part of GridFS IO

This is the "Source" part for GridFS based IO for beam.  (will work on Sink 
next, but would like to get this reviewed and merged first) . The "default" is 
to parse each file as text files (by line), but a parser function can be 
provided to take the InputStream and parse via whatever is required.   

For runners that can split into bundles it attempts to assign files in the 
grid to different bundles.  


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dkulp/incubator-beam gridfs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1003.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1003


commit d5cdc2429622f65a762774de8b5baf15334e55e2
Author: Daniel Kulp 
Date:   2016-09-16T20:58:56Z

Add GridFS io

commit a9212662744c14f10cd811540c3e9268c32c25c4
Author: Daniel Kulp 
Date:   2016-09-16T21:19:50Z

Fix checkstyle issues

commit cee0a06b6a465a276c2c5410d7d3f9af703982d4
Author: Daniel Kulp 
Date:   2016-09-19T17:22:44Z

Attempt to get a converter in there

commit fafa8fa607f22eacf918abb13419f28df9d2a8e9
Author: Daniel Kulp 
Date:   2016-09-19T17:32:44Z

Fix javac compile problem

commit 7e9872f12c74902f1a23e5a27eb0027ae753947a
Author: Daniel Kulp 
Date:   2016-09-19T17:50:11Z

Force a serializable

commit 265747946864b226235ee5b758e6c10b7cc3992f
Author: Daniel Kulp 
Date:   2016-09-19T17:56:03Z

Add the needed coder

commit 4f54495afe7ff4768d873350c345d39905d812fc
Author: Daniel Kulp 
Date:   2016-09-19T18:02:39Z

Change to using the GridFSDBFile instead of InputStream so the parsingFn 
can have access to tall the metadata

commit cbeebf02542a5e5a5f4b9a6c370b1b68b46d2deb
Author: Daniel Kulp 
Date:   2016-09-19T18:25:23Z

Flip to allowing the parser to have complete control over how the item is 
added to the collection

commit a08007b9f444fedcde78ab38c6cdf505b3864c61
Author: Daniel Kulp 
Date:   2016-09-19T18:26:19Z

Fix unused imports

commit a4840e98d891d3fa783654a472af06c4d399a929
Author: Daniel Kulp 
Date:   2016-09-21T19:51:45Z

Add test for the parser functionality and cleanup some of that code

commit 438a792a796be77186d79aa3fdb221efcced6d4f
Author: Daniel Kulp 
Date:   2016-09-21T20:01:33Z

Move the coder out from the parser

commit e8fcdbf3cebd6fa4648f328484dee07fec35b21a
Author: Daniel Kulp 
Date:   2016-09-22T12:36:50Z

Fix test

commit 1d1a373fc7cec4e78bf0e618a902c15005fc36b4
Author: Daniel Kulp 
Date:   2016-09-23T14:49:53Z

Flip to using BoundedSource so it can be broken up into bundles




> Add GridFS support to MongoDB IO
> 
>
> Key: BEAM-674
> URL: https://issues.apache.org/jira/browse/BEAM-674
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Daniel Kulp
>Assignee: Daniel Kulp
>
> MongoDB has an "extension" called GridFS that allows storing of very large 
> "files" into the MongoDB database in a relatively efficient way.   It would 
> be good to add a GridFS API based IO to allow retrieving the data for 
> processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-674) Add GridFS support to MongoDB IO

2016-09-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523309#comment-15523309
 ] 

Jean-Baptiste Onofré commented on BEAM-674:
---

Ready to review.

> Add GridFS support to MongoDB IO
> 
>
> Key: BEAM-674
> URL: https://issues.apache.org/jira/browse/BEAM-674
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Daniel Kulp
>Assignee: Jean-Baptiste Onofré
>
> MongoDB has an "extension" called GridFS that allows storing of very large 
> "files" into the MongoDB database in a relatively efficient way.   It would 
> be good to add a GridFS API based IO to allow retrieving the data for 
> processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)