[ 
https://issues.apache.org/jira/browse/BEAM-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987433#comment-16987433
 ] 

Yichi Zhang edited comment on BEAM-8884 at 12/4/19 2:23 AM:
------------------------------------------------------------

Created another Jira  https://issues.apache.org/jira/browse/BEAM-8886 for a 
follow up to cover the failure scenario in test.


was (Author: yichi):
Created another Jira  https://issues.apache.org/jira/browse/BEAM-8886 for a 
follow up to cover the failure scenario.

> Python MongoDBIO TypeError when splitting
> -----------------------------------------
>
>                 Key: BEAM-8884
>                 URL: https://issues.apache.org/jira/browse/BEAM-8884
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Yichi Zhang
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> From [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p1575350991134000]:
> I am trying to run a pipeline (defined with the Python SDK) on Dataflow that 
> uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) 
> it runs fine, when trying to run it with slightly larger datasets (70mb), I 
> always get this error:
> {code:}
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}
> Stack trace see below. Running it on a local machine works just fine. I would 
> highly appreciate any pointers what this could be.
> I hope this is the right channel do address this.
> {code:}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 649, in do_work
>     work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 218, in execute
>     self._split_task)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 226, in _perform_source_split_considering_api_limits
>     desired_bundle_size)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 263, in _perform_source_split
>     for split in source.split(desired_bundle_size):
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", 
> line 174, in split
>     bundle_end = min(stop_position, split_key_id)
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to