[
https://issues.apache.org/jira/browse/BEAM-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Strokach updated BEAM-8734:
----------------------------------
Description:
The proposed {{FileBasedCache.write}} method allows the user to write a list of
arbitrary objects to a cache. The {{element_type}} and the appropriate
{{coder}} for the list of objects is inferred using the
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This
works well for lists that are small to moderate in size, but is likely to be
very inefficient when the amount of data being written is large.
Two approaches to solving this issue have been considered:
1. We could attempt to infer the {{element_type}} from the first N elements
(e.g. first 100 elements) in the provided list. This should produce the correct
{{element_type}} for all elements in the list in the majority of cases (since
every element in the list is likely to have the same data type). In the cases
where the inferred element_type is incorrect, we could attempt to catch the
resulting errors and infer the {{element_type}} again using a larger portion of
the data.
2. If inferring the `element_type` in the first call to
{{FileBasedCache.write}} takes too long, we could instruct the user to try
again, in the first call providing a small but representative sample of the
data, while in the second call providing the rest of the data. Since the
{{element_type}} is inferred only the first time that anything is written to a
cache, subsequent calls would not have the same constraint on the number of
elements.
was:
The proposed {{FileBasedCache.write}} method allows the user to write a list of
arbitrary objects to a cache. The {{element_type}} and the appropriate
{{coder}} for the list of objects is inferred using the
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This
works well for lists that are small to moderate in size, but is likely to be
very inefficient when the amount of data being written is large.
Two approaches to solving this issue have been considered:
1. We could attempt to infer the {{element_type}} from the first N elements
(e.g. first 100 elements) in the provided list. This should produce the correct
{{element_type}} for all elements in the list in the majority of cases (since
every element in the list is likely to have the same data type). In the cases
where the inferred element_type is incorrect, we could attempt to catch the
resulting errors and infer the {{element_type}} again using a larger portion of
the data.
2. If inferring the `element_type` in the first call to
{{FileBasedCache.write}} takes too long, we could instruct the user to try
again, in the first call providing a small but representative sample of the
data, while in the second call providing the rest of the data. Since the
`element_type` is inferred only the first time that anything is written to a
cache, subsequent calls would not have the same constraint on the number of
elements.
> Optimize the inference of element_type when writing a list of objects to
> FileBasedCache
> ---------------------------------------------------------------------------------------
>
> Key: BEAM-8734
> URL: https://issues.apache.org/jira/browse/BEAM-8734
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Affects Versions: 2.16.0
> Reporter: Alexey Strokach
> Priority: Minor
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> The proposed {{FileBasedCache.write}} method allows the user to write a list
> of arbitrary objects to a cache. The {{element_type}} and the appropriate
> {{coder}} for the list of objects is inferred using the
> {{apache_beam.testing.datatype_inference.infer_element_type}} function. This
> works well for lists that are small to moderate in size, but is likely to be
> very inefficient when the amount of data being written is large.
> Two approaches to solving this issue have been considered:
> 1. We could attempt to infer the {{element_type}} from the first N elements
> (e.g. first 100 elements) in the provided list. This should produce the
> correct {{element_type}} for all elements in the list in the majority of
> cases (since every element in the list is likely to have the same data type).
> In the cases where the inferred element_type is incorrect, we could attempt
> to catch the resulting errors and infer the {{element_type}} again using a
> larger portion of the data.
> 2. If inferring the `element_type` in the first call to
> {{FileBasedCache.write}} takes too long, we could instruct the user to try
> again, in the first call providing a small but representative sample of the
> data, while in the second call providing the rest of the data. Since the
> {{element_type}} is inferred only the first time that anything is written to
> a cache, subsequent calls would not have the same constraint on the number of
> elements.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)