Re: [jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

Robert Bradshaw Thu, 07 Sep 2017 14:25:05 -0700

Ad another data point, you could try using
--runner=apache_beam.runners.portability.fn_api_runner.FnApiRunner


On Thu, Sep 7, 2017 at 1:20 PM, Ahmet Altay (JIRA) <[email protected]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157573#comment-16157573
>  ]
>
> Ahmet Altay commented on BEAM-2815:
> -----------------------------------
>
> Thank you [~pk11] for the offer. [~charleschen] is coordinating the effort 
> here, he can reach out to you if there are pieces that can be done in 
> parallel.
>
>> Python DirectRunner is unusable with input files in the 100-250MB range
>> -----------------------------------------------------------------------
>>
>>                 Key: BEAM-2815
>>                 URL: https://issues.apache.org/jira/browse/BEAM-2815
>>             Project: Beam
>>          Issue Type: Bug
>>          Components: runner-direct, sdk-py
>>    Affects Versions: 2.1.0
>>         Environment: python 2.7.10, beam 2.1, os x
>>            Reporter: Peter Hausel
>>            Assignee: Charles Chen
>>         Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
>> 2017-08-27 at 9.06.00 AM.png
>>
>>
>> The current python DirectRunner implementation seems to be unusable with 
>> training data sets that are bigger than tiny samples - making serious local 
>> development impossible or very cumbersome. I am aware of some of the 
>> limitations of the current DirectRunner implementation[1][2][3], however I 
>> was not sure if this odd behavior is expected.
>> [1][2][3]
>> https://stackoverflow.com/a/44765621
>> https://issues.apache.org/jira/browse/BEAM-1442
>> https://beam.apache.org/documentation/runners/direct/
>> Repro:
>> The simple script below blew up my laptop (MBP 2015) and had to terminate 
>> the process after 10 minutes or so (screenshots about high memory and CPU 
>> utilization are also attached).
>> {code}
>> from apache_beam.io import textio
>> import apache_beam as beam
>> from apache_beam.options.pipeline_options import PipelineOptions
>> from apache_beam.options.pipeline_options import SetupOptions
>> import argparse
>> def run(argv=None):
>>      """Main entry point; defines and runs the pipeline."""
>>      parser = argparse.ArgumentParser()
>>      parser.add_argument('--input',
>>                       dest='input',
>>                       
>> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>>                       help='Input file to process.')
>>      known_args, pipeline_args = parser.parse_known_args(argv)
>>      pipeline_options = PipelineOptions(pipeline_args)
>>      pipeline_options.view_as(SetupOptions).save_main_session = True
>>      pipeline = beam.Pipeline(options=pipeline_options)
>>      raw_data = (
>>            pipeline
>>            | 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
>> skip_header_lines=1)
>>            | 'Map' >> beam.Map(lambda line: line.lower())
>>      )
>>      result = pipeline.run()
>>      result.wait_until_finish()
>>      print(raw_data)
>> if __name__ == '__main__':
>>   run()
>> {code}
>> Example dataset:  
>> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
>> for comparison:
>> {code}
>> lines = [line.lower() for line in 
>> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
>> print(len(lines))
>> {code}
>> this vanilla python script runs on the same hardware and dataset in 0m4.909s.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)

Re: [jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

Reply via email to