[
https://issues.apache.org/jira/browse/BEAM-7081?focusedWorklogId=230072&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-230072
]
ASF GitHub Bot logged work on BEAM-7081:
----------------------------------------
Author: ASF GitHub Bot
Created on: 19/Apr/19 11:34
Start Date: 19/Apr/19 11:34
Worklog Time Spent: 10m
Work Description: romanvanderkrogt commented on pull request #8359:
[BEAM-7081] MongoDbIO: produce correct ranges for splitkeys
URL: https://github.com/apache/beam/pull/8359
When there is only a single split key, `splitKeysToFilters` does not compute
the correct result. For example, if the split key is "_id: 56", only the range
filter "_id lower than or equal to 56" is produced. It should also include a
filter "_id greater than 56". If this happens, the resulting PCollection
includes only the data until the first split; the remainder is not included.
While producing the fix for this issue, I noticed that `splitKeysToMatch`
suffers from the same issue, and additionally, that the last split is also
produced incorrectly. If the splits are, say, 56, 109, and 256, it produces
<=56, 56--109 and >= 109, instead of <=56, 56--109, 109--256 and >=256. A fix
for both issues is included as well.
------------------------
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [x] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
<br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
| --- | --- | ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 230072)
Time Spent: 10m
Remaining Estimate: 0h
> MongoDbIO.splitKeysToFilters returns incorrect filters with only one splitkey
> -----------------------------------------------------------------------------
>
> Key: BEAM-7081
> URL: https://issues.apache.org/jira/browse/BEAM-7081
> Project: Beam
> Issue Type: Bug
> Components: io-java-mongodb
> Reporter: Roman van der Krogt
> Priority: Critical
> Fix For: 2.13.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When there is only a single split key, splitKeysToFilters does not compute
> the correct result. For example, if the split key is "_id: 56", only the
> range filter "_id lower than or equal to 56" is produced. It should also
> include a filter "_id greater than 56". If this happens, the resulting
> PCollection includes only the data until the first split; the remainder is
> not included.
>
> This can be remedied with the following few lines:
>
> {{if (i == 0) {}}
> {{ // this is the first split in the list, the filter defines}}
> {{ // the range from the beginning up to this split}}
> {{ rangeFilter = String.format("\{ $and: [ {\"_id\":{$lte:%s}}",}}
> {{ getFilterString(idType, splitKey));}}
> {{ filters.add(formatFilter(rangeFilter, additionalFilter));}}
> {{{color:#f79232} {color}{color:#14892c}// If there is only one split, also
> generate a range from the split to the end{color}}}
> {{{color:#14892c} if ( splitKeys.size() == 1) {{color}}}
> {{{color:#14892c} rangeFilter = String.format("\{ $and: [
> {\"_id\":{$gt:%s}}",getFilterString(idType, splitKey));{color}}}
> {{{color:#14892c} filters.add(formatFilter(rangeFilter,
> additionalFilter));{color}}}
> {{{color:#14892c} }{color}}}
> {{}}}
>
> The corresponding test case in MongoDbIOTest should be updated to the
> following:
>
> {{@Test}}
> {{public void testSplitIntoFilters() throws Exception {}}
> {{ // A single split will result in two filters}}
> {{ ArrayList<Document> documents = new ArrayList<>();}}
> {{ documents.add(new Document("_id", 56));}}
> {{ List<String> filters =
> MongoDbIO.BoundedMongoDbSource.splitKeysToFilters(documents, null);}}
> {{ assertEquals(2, filters.size());}}
> {{ assertEquals("\{ $and: [ {\"_id\":{$lte:ObjectId(\"56\")}} ]}",
> filters.get(0));}}
> {{ assertEquals("\{ $and: [ {\"_id\":{$gt:ObjectId(\"56\")}} ]}",
> filters.get(1));}}
> {{ // Add two more splits; now we should have 4 filters}}
> {{ documents.add(new Document("_id", 109));}}
> {{ documents.add(new Document("_id", 256));}}
> {{ filters = MongoDbIO.BoundedMongoDbSource.splitKeysToFilters(documents,
> null);}}
> {{ assertEquals(4, filters.size());}}
> {{ assertEquals("\{ $and: [ {\"_id\":{$lte:ObjectId(\"56\")}} ]}",
> filters.get(0));}}
> {{ assertEquals("{ $and: [
> {\"_id\":({$gt:ObjectId(\"56\"),$lte:ObjectId(\"109\")}} ]}",}}
> {{ filters.get(1));}}
> {{ assertEquals("\{ $and: [
> {\"_id\":{$gt:ObjectId(\"109\"),$lte:ObjectId(\"256\")}} ]}",}}
> {{ filters.get(2));}}
> {{ assertEquals("\{ $and: [ {\"_id\":{$gt:ObjectId(\"256\")}} ]}",
> filters.get(3));}}
> {{}}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)