[jira] [Updated] (BEAM-9434) Performance improvements processiong a large number of Avro files in S3+Spark

Jira Wed, 04 Mar 2020 12:35:22 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ismaël Mejía updated BEAM-9434:
-------------------------------
    Status: Open  (was: Triage Needed)

> Performance improvements processiong a large number of Avro files in S3+Spark
> -----------------------------------------------------------------------------
>
>                 Key: BEAM-9434
>                 URL: https://issues.apache.org/jira/browse/BEAM-9434
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-aws, sdk-java-core
>    Affects Versions: 2.19.0
>            Reporter: Emiliano Capoccia
>            Assignee: Emiliano Capoccia
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing in Spark on K8S a large number 
> of small Avro files (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection<AvroGenClass> records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn and the cluster busy doing coordination of tiny tasks with 
> high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9434) Performance improvements processiong a large number of Avro files in S3+Spark

Reply via email to