[jira] [Commented] (BEAM-4379) Make ParquetIO Read splittable

Steve Cosenza (Jira) Wed, 05 Feb 2020 11:26:09 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030946#comment-17030946
 ]


Steve Cosenza commented on BEAM-4379:
-------------------------------------

I'm currently evaluating the scalability and performance of Google Dataflow, 
and we need the ability to read splittable Parquet files. I created an initial 
POC based on "Splittable DoFn", but I just now learned that Dataflow "_Does not 
yet support autotuning features of the Source API."_~1~ __ Additionally, the 
Beam docs state, _"__In some cases, implementing a {{Source}} might be 
necessary or result in better performance"~2~._ 

Questions:
 * Should I be targeting the BoundedSource API and will I be able to submit a 
PR that changes the existing ParquetIO to use a BoundedSource?

Thanks,

Steve

 

_1_ 
_[https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-what]_

_2_ _[https://beam.apache.org/documentation/io/developing-io-overview/]_

> Make ParquetIO Read splittable
> ------------------------------
>
>                 Key: BEAM-4379
>                 URL: https://issues.apache.org/jira/browse/BEAM-4379
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-ideas, io-java-parquet
>            Reporter: Lukasz Gajowy
>            Priority: Major
>
> As the title stands - currently it is not splittable which is not optimal for 
> runners that support splitting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-4379) Make ParquetIO Read splittable

Reply via email to