[
https://issues.apache.org/jira/browse/BEAM-13016?focusedWorklogId=678717&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-678717
]
ASF GitHub Bot logged work on BEAM-13016:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Nov/21 20:16
Start Date: 08/Nov/21 20:16
Worklog Time Spent: 10m
Work Description: tvalentyn commented on a change in pull request #15900:
URL: https://github.com/apache/beam/pull/15900#discussion_r745051997
##########
File path: sdks/python/apache_beam/io/avroio.py
##########
@@ -169,8 +159,8 @@ class ReadAllFromAvro(PTransform):
def __init__(
self,
min_bundle_size=0,
- desired_bundle_size=DEFAULT_DESIRED_BUNDLE_SIZE,
use_fastavro=True,
Review comment:
let's keep the order of the arguments
##########
File path: sdks/python/apache_beam/io/avroio.py
##########
@@ -297,130 +217,15 @@ def advance_file_past_next_sync_marker(f, sync_marker):
def _create_avro_source(
- file_pattern=None, min_bundle_size=0, validate=False, use_fastavro=True):
+ file_pattern=None, min_bundle_size=0, validate=False, use_fasvro=True):
Review comment:
typo in the argument name
##########
File path: sdks/python/apache_beam/examples/fastavro_it_test.py
##########
@@ -151,13 +141,7 @@ def batch_indices(start):
fastavro_records = \
fastavro_read_pipeline \
| 'create-fastavro' >> Create(['%s*' % fastavro_output]) \
- | 'read-fastavro' >> ReadAllFromAvro(use_fastavro=True) \
- | Map(lambda rec: (rec['number'], rec))
-
- avro_records = \
- fastavro_read_pipeline \
- | 'create-avro' >> Create(['%s*' % avro_output]) \
- | 'read-avro' >> ReadAllFromAvro(use_fastavro=False) \
+ | 'read-fastavro' >> ReadAllFromAvro() \
Review comment:
The mechanics of this test was to run a pipeline with Avro and with
FastAvro, and then check that the output was the same. Now we don't have Avro
implementation, so such test scenario wouldn't work. We could instead verify
that the checksum of the values that have bean read matches some pre-computed
value.
##########
File path: sdks/python/apache_beam/examples/fastavro_it_test.py
##########
@@ -124,22 +123,13 @@ def batch_indices(start):
| 'create-records' >> Map(record)
fastavro_output = '/'.join([self.output, 'fastavro'])
- avro_output = '/'.join([self.output, 'avro'])
+ # avro_output = '/'.join([self.output, 'avro'])
Review comment:
remove the commented out code
##########
File path: sdks/python/apache_beam/io/avroio.py
##########
@@ -297,130 +217,15 @@ def advance_file_past_next_sync_marker(f, sync_marker):
def _create_avro_source(
- file_pattern=None, min_bundle_size=0, validate=False, use_fastavro=True):
+ file_pattern=None, min_bundle_size=0, validate=False, use_fasvro=True):
Review comment:
Actually, this is an internal function (not a public API), we don't need
to add use_fastavro here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 678717)
Time Spent: 17h (was: 16h 50m)
> Remove avro-python3 dependency from Beam
> ----------------------------------------
>
> Key: BEAM-13016
> URL: https://issues.apache.org/jira/browse/BEAM-13016
> Project: Beam
> Issue Type: Bug
> Components: io-py-avro
> Reporter: Valentyn Tymofieiev
> Assignee: Ananda Prasad Inguva
> Priority: P2
> Time Spent: 17h
> Remaining Estimate: 0h
>
> Beam has been using fastavro as default dependency for Avro IO on Python3 and
> no issues have been reported. Fastavro has more frequent release cycle and is
> faster.
> Let's remove the dependency on avro-python3 to avoid branching in our
> codebase so that we don't replicate it in other parts of Beam, which was
> recently happening on:
> https://github.com/apache/beam/pull/15185#pullrequestreview-729352200
> https://github.com/apache/beam/pull/15185#pullrequestreview-731157837
--
This message was sent by Atlassian Jira
(v8.20.1#820001)