I'm using a managed notebook instance from GCP It seems those already come with cloudpickle==1.2.2 as soon as you provision it. apache-beam[gcp] will then install dill==0.3.1.1 I'm going to try to uninstall cloudpickle before installing apache-beam and see if this fixes the problem
Thank you On Tue, Feb 4, 2020 at 11:54 AM Valentyn Tymofieiev <valen...@google.com> wrote: > The fact that you have cloudpickle==1.2.2 further confirms that you may > be hitting the same error as > https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype > . > > Could you try to start over with a clean virtual environment? > > On Tue, Feb 4, 2020 at 11:46 AM Alan Krumholz <alan.krumh...@betterup.co> > wrote: > >> Hi Valentyn, >> >> Here is my pip freeze on my machine (note that the error is in dataflow, >> the job runs fine in my machine) >> >> ansiwrap==0.8.4 >> apache-beam==2.19.0 >> arrow==0.15.5 >> asn1crypto==1.3.0 >> astroid==2.3.3 >> astropy==3.2.3 >> attrs==19.3.0 >> avro-python3==1.9.1 >> azure-common==1.1.24 >> azure-storage-blob==2.1.0 >> azure-storage-common==2.1.0 >> backcall==0.1.0 >> bcolz==1.2.1 >> binaryornot==0.4.4 >> bleach==3.1.0 >> boto3==1.11.9 >> botocore==1.14.9 >> cachetools==3.1.1 >> certifi==2019.11.28 >> cffi==1.13.2 >> chardet==3.0.4 >> Click==7.0 >> cloudpickle==1.2.2 >> colorama==0.4.3 >> configparser==4.0.2 >> confuse==1.0.0 >> cookiecutter==1.7.0 >> crcmod==1.7 >> cryptography==2.8 >> cycler==0.10.0 >> daal==2019.0 >> datalab==1.1.5 >> decorator==4.4.1 >> defusedxml==0.6.0 >> dill==0.3.1.1 >> distro==1.0.1 >> docker==4.1.0 >> docopt==0.6.2 >> docutils==0.15.2 >> entrypoints==0.3 >> enum34==1.1.6 >> fairing==0.5.3 >> fastavro==0.21.24 >> fasteners==0.15 >> fsspec==0.6.2 >> future==0.18.2 >> gcsfs==0.6.0 >> gitdb2==2.0.6 >> GitPython==3.0.5 >> google-api-core==1.16.0 >> google-api-python-client==1.7.11 >> google-apitools==0.5.28 >> google-auth==1.11.0 >> google-auth-httplib2==0.0.3 >> google-auth-oauthlib==0.4.1 >> google-cloud-bigquery==1.17.1 >> google-cloud-bigtable==1.0.0 >> google-cloud-core==1.2.0 >> google-cloud-dataproc==0.6.1 >> google-cloud-datastore==1.7.4 >> google-cloud-language==1.3.0 >> google-cloud-logging==1.14.0 >> google-cloud-monitoring==0.31.1 >> google-cloud-pubsub==1.0.2 >> google-cloud-secret-manager==0.1.1 >> google-cloud-spanner==1.13.0 >> google-cloud-storage==1.25.0 >> google-cloud-translate==2.0.0 >> google-compute-engine==20191210.0 >> google-resumable-media==0.4.1 >> googleapis-common-protos==1.51.0 >> grpc-google-iam-v1==0.12.3 >> grpcio==1.26.0 >> h5py==2.10.0 >> hdfs==2.5.8 >> html5lib==1.0.1 >> htmlmin==0.1.12 >> httplib2==0.12.0 >> icc-rt==2020.0.133 >> idna==2.8 >> ijson==2.6.1 >> imageio==2.6.1 >> importlib-metadata==1.4.0 >> intel-numpy==1.15.1 >> intel-openmp==2020.0.133 >> intel-scikit-learn==0.19.2 >> intel-scipy==1.1.0 >> ipykernel==5.1.4 >> ipython==7.9.0 >> ipython-genutils==0.2.0 >> ipython-sql==0.3.9 >> ipywidgets==7.5.1 >> isort==4.3.21 >> jedi==0.16.0 >> Jinja2==2.11.0 >> jinja2-time==0.2.0 >> jmespath==0.9.4 >> joblib==0.14.1 >> json5==0.8.5 >> jsonschema==3.2.0 >> jupyter==1.0.0 >> jupyter-aihub-deploy-extension==0.1 >> jupyter-client==5.3.4 >> jupyter-console==6.1.0 >> jupyter-contrib-core==0.3.3 >> jupyter-contrib-nbextensions==0.5.1 >> jupyter-core==4.6.1 >> jupyter-highlight-selected-word==0.2.0 >> jupyter-http-over-ws==0.0.7 >> jupyter-latex-envs==1.4.6 >> jupyter-nbextensions-configurator==0.4.1 >> jupyterlab==1.2.6 >> jupyterlab-git==0.9.0 >> jupyterlab-server==1.0.6 >> keyring==10.1 >> keyrings.alt==1.3 >> kiwisolver==1.1.0 >> kubernetes==10.0.1 >> lazy-object-proxy==1.4.3 >> llvmlite==0.31.0 >> lxml==4.4.2 >> Markdown==3.1.1 >> MarkupSafe==1.1.1 >> matplotlib==3.0.3 >> mccabe==0.6.1 >> missingno==0.4.2 >> mistune==0.8.4 >> mkl==2019.0 >> mkl-fft==1.0.6 >> mkl-random==1.0.1.1 >> mock==2.0.0 >> monotonic==1.5 >> more-itertools==8.1.0 >> nbconvert==5.6.1 >> nbdime==1.1.0 >> nbformat==5.0.4 >> networkx==2.4 >> nltk==3.4.5 >> notebook==6.0.3 >> numba==0.47.0 >> numpy==1.15.1 >> oauth2client==3.0.0 >> oauthlib==3.1.0 >> opencv-python==4.1.2.30 >> oscrypto==1.2.0 >> packaging==20.1 >> pandas==0.25.3 >> pandas-profiling==1.4.0 >> pandocfilters==1.4.2 >> papermill==1.2.1 >> parso==0.6.0 >> pathlib2==2.3.5 >> pbr==5.4.4 >> pexpect==4.8.0 >> phik==0.9.8 >> pickleshare==0.7.5 >> Pillow-SIMD==6.2.2.post1 >> pipdeptree==0.13.2 >> plotly==4.5.0 >> pluggy==0.13.1 >> poyo==0.5.0 >> prettytable==0.7.2 >> prometheus-client==0.7.1 >> prompt-toolkit==2.0.10 >> protobuf==3.11.2 >> psutil==5.6.7 >> ptyprocess==0.6.0 >> py==1.8.1 >> pyarrow==0.15.1 >> pyasn1==0.4.8 >> pyasn1-modules==0.2.8 >> pycparser==2.19 >> pycrypto==2.6.1 >> pycryptodomex==3.9.6 >> pycurl==7.43.0 >> pydaal==2019.0.0.20180713 >> pydot==1.4.1 >> Pygments==2.5.2 >> pygobject==3.22.0 >> PyJWT==1.7.1 >> pylint==2.4.4 >> pymongo==3.10.1 >> pyOpenSSL==19.1.0 >> pyparsing==2.4.6 >> pyrsistent==0.15.7 >> pytest==5.3.4 >> pytest-pylint==0.14.1 >> python-apt==1.4.1 >> python-dateutil==2.8.1 >> pytz==2019.3 >> PyWavelets==1.1.1 >> pyxdg==0.25 >> PyYAML==5.3 >> pyzmq==18.1.1 >> qtconsole==4.6.0 >> requests==2.22.0 >> requests-oauthlib==1.3.0 >> retrying==1.3.3 >> rsa==4.0 >> s3transfer==0.3.2 >> scikit-image==0.15.0 >> scikit-learn==0.19.2 >> scipy==1.1.0 >> seaborn==0.9.1 >> SecretStorage==2.3.1 >> Send2Trash==1.5.0 >> simplegeneric==0.8.1 >> six==1.14.0 >> smmap2==2.0.5 >> snowflake-connector-python==2.2.0 >> SQLAlchemy==1.3.13 >> sqlparse==0.3.0 >> tbb==2019.0 >> tbb4py==2019.0 >> tenacity==6.0.0 >> terminado==0.8.3 >> testpath==0.4.4 >> textwrap3==0.9.2 >> tornado==5.1.1 >> tqdm==4.42.0 >> traitlets==4.3.3 >> typed-ast==1.4.1 >> typing==3.7.4.1 >> typing-extensions==3.7.4.1 >> unattended-upgrades==0.1 >> uritemplate==3.0.1 >> urllib3==1.24.2 >> virtualenv==16.7.9 >> wcwidth==0.1.8 >> webencodings==0.5.1 >> websocket-client==0.57.0 >> Werkzeug==0.16.1 >> whichcraft==0.6.1 >> widgetsnbextension==3.5.1 >> wrapt==1.11.2 >> zipp==1.1.0 >> >> >> On Tue, Feb 4, 2020 at 11:33 AM Valentyn Tymofieiev <valen...@google.com> >> wrote: >> >>> It don't think there is a mismatch between dill versions here, but >>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype >>> mentions >>> a similar error and may be related. What is the output of pip freeze on >>> your machine (or better: pip install pipdeptree; pipdeptree)? >>> >>> >>> On Tue, Feb 4, 2020 at 11:22 AM Alan Krumholz <alan.krumh...@betterup.co> >>> wrote: >>> >>>> Here is a test job that sometimes fails and sometimes doesn't (but most >>>> times do)..... >>>> There seems to be something stochastic that causes this as after >>>> several tests a couple of them did succeed.... >>>> >>>> >>>> def test_error( >>>> bq_table: str) -> str: >>>> >>>> import apache_beam as beam >>>> from apache_beam.options.pipeline_options import PipelineOptions >>>> >>>> class GenData(beam.DoFn): >>>> def process(self, _): >>>> for _ in range (20000): >>>> yield {'a':1,'b':2} >>>> >>>> >>>> def get_bigquery_schema(): >>>> from apache_beam.io.gcp.internal.clients import bigquery >>>> >>>> table_schema = bigquery.TableSchema() >>>> columns = [ >>>> ["a","integer","nullable"], >>>> ["b","integer","nullable"] >>>> ] >>>> >>>> for column in columns: >>>> column_schema = bigquery.TableFieldSchema() >>>> column_schema.name = column[0] >>>> column_schema.type = column[1] >>>> column_schema.mode = column[2] >>>> table_schema.fields.append(column_schema) >>>> >>>> return table_schema >>>> >>>> pipeline = beam.Pipeline(options=PipelineOptions( >>>> project='my-project', >>>> temp_location = 'gs://my-bucket/temp', >>>> staging_location = 'gs://my-bucket/staging', >>>> runner='DataflowRunner' >>>> )) >>>> #pipeline = beam.Pipeline() >>>> >>>> ( >>>> pipeline >>>> | 'Empty start' >> beam.Create(['']) >>>> | 'Generate Data' >> beam.ParDo(GenData()) >>>> #| 'print' >> beam.Map(print) >>>> | 'Write to BigQuery' >> beam.io.WriteToBigQuery( >>>> project=bq_table.split(':')[0], >>>> dataset=bq_table.split(':')[1].split('.')[0], >>>> table=bq_table.split(':')[1].split('.')[1], >>>> schema=get_bigquery_schema(), >>>> >>>> create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, >>>> >>>> write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE) >>>> ) >>>> >>>> result = pipeline.run() >>>> result.wait_until_finish() >>>> >>>> return True >>>> >>>> test_error( >>>> bq_table = 'my-project:my_dataset.my_table' >>>> ) >>>> >>>> On Tue, Feb 4, 2020 at 10:04 AM Alan Krumholz < >>>> alan.krumh...@betterup.co> wrote: >>>> >>>>> I tried breaking apart my pipeline. Seems the step that breaks it is: >>>>> beam.io.WriteToBigQuery >>>>> >>>>> Let me see if I can create a self contained example that breaks to >>>>> share with you >>>>> >>>>> Thanks! >>>>> >>>>> On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada <pabl...@google.com> >>>>> wrote: >>>>> >>>>>> Hm that's odd. No changes to the pipeline? Are you able to share some >>>>>> of the code? >>>>>> >>>>>> +Udi Meiri <eh...@google.com> do you have any idea what could be >>>>>> going on here? >>>>>> >>>>>> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz < >>>>>> alan.krumh...@betterup.co> wrote: >>>>>> >>>>>>> Hi Pablo, >>>>>>> This is strange... it doesn't seem to be the last beam release as >>>>>>> last night it was already using 2.19.0 I wonder if it was some release >>>>>>> from >>>>>>> the DataFlow team (not beam related): >>>>>>> Job typeBatch >>>>>>> Job status Succeeded >>>>>>> SDK version >>>>>>> Apache Beam Python 3.5 SDK 2.19.0 >>>>>>> Region >>>>>>> us-central1 >>>>>>> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8 >>>>>>> Elapsed time5 min 11 sec >>>>>>> >>>>>>> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada <pabl...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Alan, >>>>>>>> could it be that you're picking up the new Apache Beam 2.19.0 >>>>>>>> release? Could you try depending on beam 2.18.0 to see if the issue >>>>>>>> surfaces when using the new release? >>>>>>>> >>>>>>>> If something was working and no longer works, it sounds like a bug. >>>>>>>> This may have to do with how we pickle (dill / cloudpickle) - see this >>>>>>>> question >>>>>>>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype >>>>>>>> Best >>>>>>>> -P. >>>>>>>> >>>>>>>> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz < >>>>>>>> alan.krumh...@betterup.co> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I was running a dataflow job in GCP last night and it was running >>>>>>>>> fine. >>>>>>>>> This morning this same exact job is failing with the following >>>>>>>>> error: >>>>>>>>> >>>>>>>>> Error message from worker: Traceback (most recent call last): File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", >>>>>>>>> line 286, in loads return dill.loads(s) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in >>>>>>>>> loads >>>>>>>>> return load(file, ignore, **kwds) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in >>>>>>>>> load >>>>>>>>> return Unpickler(file, ignore=ignore, **kwds).load() File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in >>>>>>>>> load >>>>>>>>> obj = StockUnpickler.load(self) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in >>>>>>>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During >>>>>>>>> handling of the above exception, another exception occurred: Traceback >>>>>>>>> (most recent call last): File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py", >>>>>>>>> line 648, in do_work work_executor.execute() File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", >>>>>>>>> line >>>>>>>>> 176, in execute op.start() File >>>>>>>>> "apache_beam/runners/worker/operations.py", >>>>>>>>> line 649, in apache_beam.runners.worker.operations.DoOperation.start >>>>>>>>> File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 651, in >>>>>>>>> apache_beam.runners.worker.operations.DoOperation.start File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 652, in >>>>>>>>> apache_beam.runners.worker.operations.DoOperation.start File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 261, in >>>>>>>>> apache_beam.runners.worker.operations.Operation.start File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 266, in >>>>>>>>> apache_beam.runners.worker.operations.Operation.start File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 597, in >>>>>>>>> apache_beam.runners.worker.operations.DoOperation.setup File >>>>>>>>> "apache_beam/runners/worker/operations.py", line 602, in >>>>>>>>> apache_beam.runners.worker.operations.DoOperation.setup File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", >>>>>>>>> line 290, in loads return dill.loads(s) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in >>>>>>>>> loads >>>>>>>>> return load(file, ignore, **kwds) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in >>>>>>>>> load >>>>>>>>> return Unpickler(file, ignore=ignore, **kwds).load() File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in >>>>>>>>> load >>>>>>>>> obj = StockUnpickler.load(self) File >>>>>>>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in >>>>>>>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType' >>>>>>>>> >>>>>>>>> >>>>>>>>> If I use a local runner it still runs fine. >>>>>>>>> Anyone else experiencing something similar today? (or know how to >>>>>>>>> fix this?) >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>