[jira] [Updated] (BEAM-8734) Optimize the inference of element_type when writing a list of objects to FileBasedCache

Alexey Strokach (Jira) Mon, 18 Nov 2019 10:18:14 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Strokach updated BEAM-8734:
----------------------------------
    Description: 
The proposed {{FileBasedCache.write}} method allows the user to write a list of 
arbitrary objects to a cache. The {{element_type}} and the appropriate 
{{coder}} for the list of objects is inferred using the 
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This 
works well for lists that are small to moderate in size, but is likely to be 
very inefficient when the amount of data being written is large.

Two approaches to solving this issue have been considered:

1. We could attempt to infer the {{element_type}} from the first N elements 
(e.g. first 100 elements) in the provided list. This should produce the correct 
{{element_type}} for all elements in the list in the majority of cases (since 
every element in the list is likely to have the same data type). In  the cases 
where the inferred element_type is incorrect, we could attempt to catch the 
resulting errors and infer the {{element_type}} again using a larger portion of 
the data.

2. If inferring the `element_type` in the first call to 
{{FileBasedCache.write}} takes too long, we could instruct the user to try 
again, in the first call providing a small but representative sample of the 
data, while in the second call providing the rest of the data. Since the 
{{element_type}} is inferred only the first time that anything is written to a 
cache, subsequent calls would not have the same constraint on the number of 
elements.

  was:
The proposed {{FileBasedCache.write}} method allows the user to write a list of 
arbitrary objects to a cache. The {{element_type}} and the appropriate 
{{coder}} for the list of objects is inferred using the 
{{apache_beam.testing.datatype_inference.infer_element_type}} function. This 
works well for lists that are small to moderate in size, but is likely to be 
very inefficient when the amount of data being written is large.

Two approaches to solving this issue have been considered:

1. We could attempt to infer the {{element_type}} from the first N elements 
(e.g. first 100 elements) in the provided list. This should produce the correct 
{{element_type}} for all elements in the list in the majority of cases (since 
every element in the list is likely to have the same data type). In  the cases 
where the inferred element_type is incorrect, we could attempt to catch the 
resulting errors and infer the {{element_type}} again using a larger portion of 
the data.

2. If inferring the `element_type` in the first call to 
{{FileBasedCache.write}} takes too long, we could instruct the user to try 
again, in the first call providing a small but representative sample of the 
data, while in the second call providing the rest of the data. Since the 
`element_type` is inferred only the first time that anything is written to a 
cache, subsequent calls would not have the same constraint on the number of 
elements.


> Optimize the inference of element_type when writing a list of objects to 
> FileBasedCache
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-8734
>                 URL: https://issues.apache.org/jira/browse/BEAM-8734
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>    Affects Versions: 2.16.0
>            Reporter: Alexey Strokach
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The proposed {{FileBasedCache.write}} method allows the user to write a list 
> of arbitrary objects to a cache. The {{element_type}} and the appropriate 
> {{coder}} for the list of objects is inferred using the 
> {{apache_beam.testing.datatype_inference.infer_element_type}} function. This 
> works well for lists that are small to moderate in size, but is likely to be 
> very inefficient when the amount of data being written is large.
> Two approaches to solving this issue have been considered:
> 1. We could attempt to infer the {{element_type}} from the first N elements 
> (e.g. first 100 elements) in the provided list. This should produce the 
> correct {{element_type}} for all elements in the list in the majority of 
> cases (since every element in the list is likely to have the same data type). 
> In  the cases where the inferred element_type is incorrect, we could attempt 
> to catch the resulting errors and infer the {{element_type}} again using a 
> larger portion of the data.
> 2. If inferring the `element_type` in the first call to 
> {{FileBasedCache.write}} takes too long, we could instruct the user to try 
> again, in the first call providing a small but representative sample of the 
> data, while in the second call providing the rest of the data. Since the 
> {{element_type}} is inferred only the first time that anything is written to 
> a cache, subsequent calls would not have the same constraint on the number of 
> elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-8734) Optimize the inference of element_type when writing a list of objects to FileBasedCache

Reply via email to