[
https://issues.apache.org/jira/browse/BEAM-7450?focusedWorklogId=250494&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-250494
]
ASF GitHub Bot logged work on BEAM-7450:
----------------------------------------
Author: ASF GitHub Bot
Created on: 29/May/19 20:43
Start Date: 29/May/19 20:43
Worklog Time Spent: 10m
Work Description: jhalaria commented on pull request #8718: [BEAM-7450] -
Add an unbounded HcatalogIO reader using splittable pardo
URL: https://github.com/apache/beam/pull/8718
- This PR adds a new unbounded reader based on splittable pardo(s) to read
data from hcat.
- There are 4 main aspects of the PR:
1. PartitionPoller - responsible for polling for new partitions and passing
that along to the hcat reader
2. HCatRecordReader - This is pretty similar to the current read function
inside HcatalogIO. Only difference is that this reads data for a specific
partition generated by the poller.
3. User of the reader has to specify a function that will know how to
advance the watermark.
4. They would also need to provide a way to compare partitions. This is how
we would sort and know where in the list should we start reading from.
If the general idea here sounds good, there is some synergy between the
HCatRecordReader and the reader inside HCatalogIO. I can possibly reuse some of
the methods from that by moving them to a helper class.
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [ ] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
<br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
| --- | --- | ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 250494)
Time Spent: 10m
Remaining Estimate: 0h
> Unbounded HCatalogIO Reader using splittable pardos
> ---------------------------------------------------
>
> Key: BEAM-7450
> URL: https://issues.apache.org/jira/browse/BEAM-7450
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-core
> Reporter: Ankit Jhalaria
> Assignee: Ankit Jhalaria
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> # Current version of HcatalogIO is a bounded source.
> # While migrating our jobs to aws, we realized that it would be helpful to
> have an unbounded hcat reader that can behave as an unbounded source and
> polls for new partitions as and when they become available.
> # I have used splittable pardo(s) to do this. There is a flag that can be
> set to treat this as a bounded source which will terminate if that flag is
> set.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)