[
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Chen resolved BEAM-2815.
--------------------------------
Resolution: Fixed
Fix Version/s: 2.3.0
> Python DirectRunner is unusable with input files in the 100-250MB range
> -----------------------------------------------------------------------
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
> Issue Type: Bug
> Components: runner-direct, sdk-py-core
> Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x
> Reporter: Peter Hausel
> Assignee: Charles Chen
> Priority: Major
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with
> training data sets that are bigger than tiny samples - making serious local
> development impossible or very cumbersome. I am aware of some of the
> limitations of the current DirectRunner implementation[1][2][3], however I
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the
> process after 10 minutes or so (screenshots about high memory and CPU
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
> """Main entry point; defines and runs the pipeline."""
> parser = argparse.ArgumentParser()
> parser.add_argument('--input',
> dest='input',
>
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
> help='Input file to process.')
> known_args, pipeline_args = parser.parse_known_args(argv)
> pipeline_options = PipelineOptions(pipeline_args)
> pipeline_options.view_as(SetupOptions).save_main_session = True
> pipeline = beam.Pipeline(options=pipeline_options)
> raw_data = (
> pipeline
> | 'ReadTrainData' >> textio.ReadFromText(known_args.input,
> skip_header_lines=1)
> | 'Map' >> beam.Map(lambda line: line.lower())
> )
> result = pipeline.run()
> result.wait_until_finish()
> print(raw_data)
> if __name__ == '__main__':
> run()
> {code}
> Example dataset:
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison:
> {code}
> lines = [line.lower() for line in
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)