Alexey Strokach created BEAM-8734:
-------------------------------------

             Summary: Optimize the inference of element_type when writing a 
list of objects to FileBasedCache
                 Key: BEAM-8734
                 URL: https://issues.apache.org/jira/browse/BEAM-8734
             Project: Beam
          Issue Type: Improvement
          Components: sdk-py-core
    Affects Versions: 2.16.0
            Reporter: Alexey Strokach


The proposed {{FileBasedCache.write}} method allows the user to write a list of 
arbitrary objects to a file. The {{element_type}} and the appropriate {{coder}} 
for the list of objects is inferred using the 
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This 
works well for lists that are small to moderate in size, but is likely to be 
very inefficient when the amount of data being written is large.

Two approaches to solving this issue have been considered:

1. We could attempt to infer the {{element_type}} from the first N elements 
(e.g. first 100 elements) in the list. This should produce the correct 
{{element_type}} in the majority of cases. In  the cases where the inferred 
element_type is incorrect, we could attempt to cache the resulting errors and 
infer the {{element_type}} again using a larger portion of the data.

2. If inferring the `element_type` in the first call to 
{{FileBasedCache.write}} takes too long, we could instruct the user to try 
again, in the first call providing a small but representative sample of the 
data while in the second call providing the rest of the data. Since the 
`element_type` is inferred only the first time that anything is written to a 
cache, subsequent calls would not have the same constraint on the number of 
elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to