Alexey Strokach created BEAM-8734:
-------------------------------------
Summary: Optimize the inference of element_type when writing a
list of objects to FileBasedCache
Key: BEAM-8734
URL: https://issues.apache.org/jira/browse/BEAM-8734
Project: Beam
Issue Type: Improvement
Components: sdk-py-core
Affects Versions: 2.16.0
Reporter: Alexey Strokach
The proposed {{FileBasedCache.write}} method allows the user to write a list of
arbitrary objects to a file. The {{element_type}} and the appropriate {{coder}}
for the list of objects is inferred using the
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This
works well for lists that are small to moderate in size, but is likely to be
very inefficient when the amount of data being written is large.
Two approaches to solving this issue have been considered:
1. We could attempt to infer the {{element_type}} from the first N elements
(e.g. first 100 elements) in the list. This should produce the correct
{{element_type}} in the majority of cases. In the cases where the inferred
element_type is incorrect, we could attempt to cache the resulting errors and
infer the {{element_type}} again using a larger portion of the data.
2. If inferring the `element_type` in the first call to
{{FileBasedCache.write}} takes too long, we could instruct the user to try
again, in the first call providing a small but representative sample of the
data while in the second call providing the rest of the data. Since the
`element_type` is inferred only the first time that anything is written to a
cache, subsequent calls would not have the same constraint on the number of
elements.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)