[
https://issues.apache.org/jira/browse/BEAM-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Webb updated BEAM-8734:
----------------------------
Resolution: Fixed
Status: Resolved (was: Triage Needed)
attached link makes it look like this has been resolved.
> Optimize the inference of element_type when writing a list of objects to
> FileBasedCache
> ---------------------------------------------------------------------------------------
>
> Key: BEAM-8734
> URL: https://issues.apache.org/jira/browse/BEAM-8734
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Affects Versions: 2.16.0
> Reporter: Alexey Strokach
> Priority: P3
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> The proposed {{FileBasedCache.write}} method allows the user to write a list
> of arbitrary objects to a cache. The {{element_type}} and the appropriate
> {{coder}} for the list of objects is inferred using the
> {{apache_beam.testing.datatype_inference.infer_element_type}} function. This
> works well for lists that are small to moderate in size, but is likely to be
> very inefficient when the amount of data being written is large.
> Two approaches to solving this issue have been considered:
> 1. We could attempt to infer the {{element_type}} from the first N elements
> (e.g. first 100 elements) in the provided list. This should produce the
> correct {{element_type}} for all elements in the list in the majority of
> cases (since every element in the list is likely to have the same data type).
> In the cases where the inferred element_type is incorrect, we could attempt
> to catch the resulting errors and infer the {{element_type}} again using a
> larger portion of the data.
> 2. If inferring the `element_type` in the first call to
> {{FileBasedCache.write}} takes too long, we could instruct the user to try
> again, in the first call providing a small but representative sample of the
> data, while in the second call providing the rest of the data. Since the
> {{element_type}} is inferred only the first time that anything is written to
> a cache, subsequent calls would not have the same constraint on the number of
> elements.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)