Ad another data point, you could try using --runner=apache_beam.runners.portability.fn_api_runner.FnApiRunner
On Thu, Sep 7, 2017 at 1:20 PM, Ahmet Altay (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157573#comment-16157573 > ] > > Ahmet Altay commented on BEAM-2815: > ----------------------------------- > > Thank you [~pk11] for the offer. [~charleschen] is coordinating the effort > here, he can reach out to you if there are pieces that can be done in > parallel. > >> Python DirectRunner is unusable with input files in the 100-250MB range >> ----------------------------------------------------------------------- >> >> Key: BEAM-2815 >> URL: https://issues.apache.org/jira/browse/BEAM-2815 >> Project: Beam >> Issue Type: Bug >> Components: runner-direct, sdk-py >> Affects Versions: 2.1.0 >> Environment: python 2.7.10, beam 2.1, os x >> Reporter: Peter Hausel >> Assignee: Charles Chen >> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot >> 2017-08-27 at 9.06.00 AM.png >> >> >> The current python DirectRunner implementation seems to be unusable with >> training data sets that are bigger than tiny samples - making serious local >> development impossible or very cumbersome. I am aware of some of the >> limitations of the current DirectRunner implementation[1][2][3], however I >> was not sure if this odd behavior is expected. >> [1][2][3] >> https://stackoverflow.com/a/44765621 >> https://issues.apache.org/jira/browse/BEAM-1442 >> https://beam.apache.org/documentation/runners/direct/ >> Repro: >> The simple script below blew up my laptop (MBP 2015) and had to terminate >> the process after 10 minutes or so (screenshots about high memory and CPU >> utilization are also attached). >> {code} >> from apache_beam.io import textio >> import apache_beam as beam >> from apache_beam.options.pipeline_options import PipelineOptions >> from apache_beam.options.pipeline_options import SetupOptions >> import argparse >> def run(argv=None): >> """Main entry point; defines and runs the pipeline.""" >> parser = argparse.ArgumentParser() >> parser.add_argument('--input', >> dest='input', >> >> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv', >> help='Input file to process.') >> known_args, pipeline_args = parser.parse_known_args(argv) >> pipeline_options = PipelineOptions(pipeline_args) >> pipeline_options.view_as(SetupOptions).save_main_session = True >> pipeline = beam.Pipeline(options=pipeline_options) >> raw_data = ( >> pipeline >> | 'ReadTrainData' >> textio.ReadFromText(known_args.input, >> skip_header_lines=1) >> | 'Map' >> beam.Map(lambda line: line.lower()) >> ) >> result = pipeline.run() >> result.wait_until_finish() >> print(raw_data) >> if __name__ == '__main__': >> run() >> {code} >> Example dataset: >> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009 >> for comparison: >> {code} >> lines = [line.lower() for line in >> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')] >> print(len(lines)) >> {code} >> this vanilla python script runs on the same hardware and dataset in 0m4.909s. > > > > -- > This message was sent by Atlassian JIRA > (v6.4.14#64029)
