[ 
https://issues.apache.org/jira/browse/BEAM-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Webb updated BEAM-8734:
----------------------------
    Resolution: Fixed
        Status: Resolved  (was: Triage Needed)

attached link makes it look like this has been resolved.

> Optimize the inference of element_type when writing a list of objects to 
> FileBasedCache
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-8734
>                 URL: https://issues.apache.org/jira/browse/BEAM-8734
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>    Affects Versions: 2.16.0
>            Reporter: Alexey Strokach
>            Priority: P3
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The proposed {{FileBasedCache.write}} method allows the user to write a list 
> of arbitrary objects to a cache. The {{element_type}} and the appropriate 
> {{coder}} for the list of objects is inferred using the 
> {{apache_beam.testing.datatype_inference.infer_element_type}} function. This 
> works well for lists that are small to moderate in size, but is likely to be 
> very inefficient when the amount of data being written is large.
> Two approaches to solving this issue have been considered:
> 1. We could attempt to infer the {{element_type}} from the first N elements 
> (e.g. first 100 elements) in the provided list. This should produce the 
> correct {{element_type}} for all elements in the list in the majority of 
> cases (since every element in the list is likely to have the same data type). 
> In  the cases where the inferred element_type is incorrect, we could attempt 
> to catch the resulting errors and infer the {{element_type}} again using a 
> larger portion of the data.
> 2. If inferring the `element_type` in the first call to 
> {{FileBasedCache.write}} takes too long, we could instruct the user to try 
> again, in the first call providing a small but representative sample of the 
> data, while in the second call providing the rest of the data. Since the 
> {{element_type}} is inferred only the first time that anything is written to 
> a cache, subsequent calls would not have the same constraint on the number of 
> elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to