[
https://issues.apache.org/jira/browse/ARROW-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097839#comment-16097839
]
Aditi Breed commented on ARROW-1247:
------------------------------------
Hello Wes,
Thank you for your reply.Greatly appreciate your help here.
After I posted this issues, I went back to the server, and uninstalled
Python/Anaconda completely.
I installed Anaconda Python 3.6 and installed pyarrow using pip instead of
conda.
The code worked fr smaller data-sets that day ( of the order of thousands).
Your reply back that afternoon also helped me believe that the problem was
solved.
However I wanted to do a holistic test, to ensure that I had really solved the
problem by installing the pip version.
So I ran the code on a huge dataset( not huge by Big Data Standards though),
1633665 records were being saved in this latest run( for each aggregations).
Just a quick order of the flow of code:
1) get data
2) remove nans ( I double checked this by saving , re-reading and checking on
the dataset).
3) Aggregate to get Means - Save as parquet with pyarrow
4) Aggregate to get StdDev - Save as parquet with pyarrow
In the case above, it saved the parquet file for Step 3, but failed again on
Step 4.
Python again crashed with error" Python has stopped working".
Following is the detailed error:
Description:
Stopped working
Problem signature:
Problem Event Name: APPCRASH
Application Name: python.exe
Application Version: 3.6.2150.1013
Application Timestamp: 5970e8c6
Fault Module Name: VCRUNTIME140.dll
Fault Module Version: 14.0.24210.0
Fault Module Timestamp: 575a4cd0
Exception Code: c0000005
Exception Offset: 000000000000c387
OS Version: 6.3.9600.2.0.0.400.8
Locale ID: 1033
Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=280262
If the online privacy statement is not available, please read our privacy
statement offline:
C:\Windows\system32\en-US\erofflps.txt
If pyarrow was able to save the first aggregated Arrow table, it does not have
a problem with saving the file itself.
Do I the correct versiomn of the VC++ redistributable for VS2015 ?
Currently I have x64 version of VC++ redistributable , 14.0.24215..
The machine also has the x86 version of the same redistributable ( dont know
how and why )
I decided to do conda list and checks the results, though it shows that it has
pre version of arrow-cpp ( i dont know why because I did not install it from
conda. I did a pip install pyarrow, so I assumed it will bring in the
associated cpp's.).
>conda list
# packages in environment at XXX:
#
_license 1.1 py36_1
alabaster 0.7.10 py36_0
anaconda custom py36_0
anaconda-client 1.6.3 py36_0
anaconda-navigator 1.6.3 py36_0
anaconda-project 0.6.0 py36_0
arrow-cpp 0.5.0.pre np113py36_vc14_4 [vc14] conda-forge
asn1crypto 0.22.0 py36_0
astroid 1.4.9 py36_0
astropy 2.0 np113py36_0
babel 2.4.0 py36_0
backports 1.0 py36_0
beautifulsoup4 4.6.0 py36_0
bitarray 0.8.1 py36_1
bkcharts 0.2 py36_0
blaze 0.10.1 py36_0
bleach 1.5.0 py36_0
bokeh 0.12.6 py36_0
boto 2.47.0 py36_0
bottleneck 1.2.1 np113py36_0
bzip2 1.0.6 vc14_3 [vc14]
cffi 1.10.0 py36_0
chardet 3.0.4 py36_0
click 6.7 py36_0
cloudpickle 0.2.2 py36_0
clyent 1.2.2 py36_0
colorama 0.3.9 py36_0
comtypes 1.1.2 py36_0
conda 4.3.22 py36_0 conda-forge
conda-env 2.6.0 0 conda-forge
console_shortcut 0.1.1 py36_1
contextlib2 0.5.5 py36_0
cryptography 1.8.1 py36_0
curl 7.52.1 vc14_0 [vc14]
cycler 0.10.0 py36_0
cython 0.25.2 py36_0
cytoolz 0.8.2 py36_0
dask 0.15.0 py36_0
datashape 0.5.4 py36_0
decorator 4.0.11 py36_0
distributed 1.17.1 py36_0
docutils 0.13.1 py36_0
entrypoints 0.2.3 py36_0
et_xmlfile 1.0.1 py36_0
fastcache 1.0.2 py36_1
flask 0.12.2 py36_0
flask-cors 3.0.2 py36_0
freetype 2.5.5 vc14_2 [vc14]
get_terminal_size 1.0.0 py36_0
gevent 1.2.2 py36_0
greenlet 0.4.12 py36_0
h5py 2.7.0 np113py36_0
hdf5 1.8.15.1 vc14_4 [vc14]
heapdict 1.0.0 py36_1
html5lib 0.999 py36_0
icu 57.1 vc14_0 [vc14]
idna 2.5 py36_0
imagesize 0.7.1 py36_0
ipykernel 4.6.1 py36_0
ipython 6.1.0 py36_0
ipython_genutils 0.2.0 py36_0
ipywidgets 6.0.0 py36_0
isort 4.2.15 py36_0
itsdangerous 0.24 py36_0
jdcal 1.3 py36_0
jedi 0.10.2 py36_2
jinja2 2.9.6 py36_0
jpeg 9b vc14_0 [vc14]
jsonschema 2.6.0 py36_0
jupyter 1.0.0 py36_3
jupyter_client 5.1.0 py36_0
jupyter_console 5.1.0 py36_0
jupyter_core 4.3.0 py36_0
lazy-object-proxy 1.3.1 py36_0
libpng 1.6.27 vc14_0 [vc14]
libtiff 4.0.6 vc14_3 [vc14]
llvmlite 0.19.0 py36_0
locket 0.2.0 py36_1
lxml 3.8.0 py36_0
markupsafe 0.23 py36_2
matplotlib 2.0.2 np113py36_0
menuinst 1.4.7 py36_0
mistune 0.7.4 py36_0
mkl 2017.0.3 0
mkl-service 1.1.2 py36_3
mpmath 0.19 py36_1
msgpack-python 0.4.8 py36_0
multipledispatch 0.4.9 py36_0
navigator-updater 0.1.0 py36_0
nbconvert 5.2.1 py36_0
nbformat 4.3.0 py36_0
networkx 1.11 py36_0
nltk 3.2.4 py36_0
nose 1.3.7 py36_1
notebook 5.0.0 py36_0
numba 0.34.0 np113py36_5
numexpr 2.6.2 np113py36_0
numpy 1.13.1 py36_0
numpydoc 0.6.0 py36_0
odo 0.5.0 py36_1
olefile 0.44 py36_0
openpyxl 2.4.7 py36_0
openssl 1.0.2l vc14_0 [vc14]
packaging 16.8 py36_0
pandas 0.20.2 np113py36_0
pandocfilters 1.4.1 py36_0
parquet-cpp 1.2.0.pre vc14_2 [vc14] conda-forge
partd 0.3.8 py36_0
path.py 10.3.1 py36_0
pathlib2 2.2.1 py36_0
patsy 0.4.1 py36_0
pep8 1.7.0 py36_0
pickleshare 0.7.4 py36_0
pillow 4.2.1 py36_0
pip 9.0.1 py36_1
ply 3.10 py36_0
prompt_toolkit 1.0.14 py36_0
psutil 5.2.2 py36_0
py 1.4.34 py36_0
py4j 0.10.4 <pip>
pyarrow 0.4.1 <pip>
pycosat 0.6.2 py36_0
pycparser 2.17 py36_0
pycrypto 2.6.1 py36_6
pycurl 7.43.0 py36_2
pyflakes 1.5.0 py36_0
pygments 2.2.0 py36_0
pylint 1.6.4 py36_1
pyodbc 4.0.17 py36_0
pyopenssl 17.0.0 py36_0
pyparsing 2.1.4 py36_0
pyqt 5.6.0 py36_2
pytables 3.2.2 np113py36_4
pytest 3.1.2 py36_0
python 3.6.2 0
python-dateutil 2.6.0 py36_0
pytz 2017.2 py36_0
PyUber 1.4.4 <pip>
pywavelets 0.5.2 np113py36_0
pywin32 220 py36_2
pyyaml 3.12 py36_0
pyzmq 16.0.2 py36_0
qt 5.6.2 vc14_5 [vc14]
qtawesome 0.4.4 py36_0
qtconsole 4.3.0 py36_0
qtpy 1.2.1 py36_0
requests 2.14.2 py36_0
rope 0.9.4 py36_1
ruamel_yaml 0.11.14 py36_1
scikit-image 0.13.0 np113py36_0
scikit-learn 0.18.2 np113py36_0
scipy 0.19.1 np113py36_0
seaborn 0.7.1 py36_0
setuptools 27.2.0 py36_1
simplegeneric 0.8.1 py36_1
singledispatch 3.4.0.3 py36_0
sip 4.18 py36_0
six 1.10.0 py36_0
snowballstemmer 1.2.1 py36_0
sortedcollections 0.5.3 py36_0
sortedcontainers 1.5.7 py36_0
sphinx 1.6.2 py36_0
sphinxcontrib 1.0 py36_0
sphinxcontrib-websupport 1.0.1 py36_0
spyder 3.1.4 py36_0
sqlalchemy 1.1.11 py36_0
statsmodels 0.8.0 np113py36_0
sympy 1.1 py36_0
tblib 1.3.2 py36_0
testpath 0.3.1 py36_0
tk 8.5.18 vc14_0 [vc14]
toolz 0.8.2 py36_0
tornado 4.5.1 py36_0
traitlets 4.3.2 py36_0
unicodecsv 0.14.1 py36_0
vc 14 0 conda-forge
vs2015_runtime 14.0.25420 0
wcwidth 0.1.7 py36_0
werkzeug 0.12.2 py36_0
wheel 0.29.0 py36_0
widgetsnbextension 2.0.0 py36_0
win_unicode_console 0.5 py36_0
wrapt 1.10.10 py36_0
xlrd 1.0.0 py36_0
xlsxwriter 0.9.6 py36_0
xlwings 0.10.4 py36_0
xlwt 1.2.0 py36_0
zict 0.1.2 py36_0
zlib 1.2.8 vc14_3 [vc14]
I also did a pip freeze:
>pip freeze
alabaster==0.7.10
anaconda-client==1.6.3
anaconda-navigator==1.6.3
anaconda-project==0.6.0
asn1crypto==0.22.0
astroid==1.4.9
astropy==2.0
Babel==2.4.0
backports.shutil-get-terminal-size==1.0.0
beautifulsoup4==4.6.0
bitarray==0.8.1
bkcharts==0.2
blaze==0.10.1
bleach==1.5.0
bokeh==0.12.6
boto==2.47.0
Bottleneck==1.2.1
cffi==1.10.0
chardet==3.0.4
click==6.7
cloudpickle==0.2.2
clyent==1.2.2
colorama==0.3.9
comtypes==1.1.2
conda==4.3.22
contextlib2==0.5.5
cryptography==1.8.1
cycler==0.10.0
Cython==0.25.2
cytoolz==0.8.2
dask==0.15.0
datashape==0.5.4
decorator==4.0.11
distributed==1.17.1
docutils==0.13.1
entrypoints==0.2.3
et-xmlfile==1.0.1
fastcache==1.0.2
Flask==0.12.2
Flask-Cors==3.0.2
gevent==1.2.2
greenlet==0.4.12
h5py==2.7.0
HeapDict==1.0.0
html5lib==0.999
idna==2.5
imagesize==0.7.1
ipykernel==4.6.1
ipython==6.1.0
ipython-genutils==0.2.0
ipywidgets==6.0.0
isort==4.2.15
itsdangerous==0.24
jdcal==1.3
jedi==0.10.2
Jinja2==2.9.6
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.1.0
jupyter-console==5.1.0
jupyter-core==4.3.0
lazy-object-proxy==1.3.1
llvmlite==0.19.0
locket==0.2.0
lxml==3.8.0
MarkupSafe==0.23
matplotlib==2.0.2
menuinst==1.4.7
mistune==0.7.4
mpmath==0.19
msgpack-python==0.4.8
multipledispatch==0.4.9
navigator-updater==0.1.0
nbconvert==5.2.1
nbformat==4.3.0
networkx==1.11
nltk==3.2.4
nose==1.3.7
notebook==5.0.0
numba==0.34.0+5.g1762237
numexpr==2.6.2
numpy==1.13.1
numpydoc==0.6.0
odo==0.5.0
olefile==0.44
openpyxl==2.4.7
packaging==16.8
pandas==0.20.2
pandocfilters==1.4.1
partd==0.3.8
path.py==10.3.1
pathlib2==2.2.1
patsy==0.4.1
pep8==1.7.0
pickleshare==0.7.4
Pillow==4.2.1
ply==3.10
prompt-toolkit==1.0.14
psutil==5.2.2
py==1.4.34
py4j==0.10.4
pyarrow==0.4.1
pycosat==0.6.2
pycparser==2.17
pycrypto==2.6.1
pycurl==7.43.0
pyflakes==1.5.0
Pygments==2.2.0
pylint==1.6.4
pyodbc==4.0.17
pyOpenSSL==17.0.0
pyparsing==2.1.4
pyspark==2.1.1+hadoop2.7
pytest==3.1.2
python-dateutil==2.6.0
pytz==2017.2
PyUber==1.4.4
PyWavelets==0.5.2
pywin32==220
PyYAML==3.12
pyzmq==16.0.2
QtAwesome==0.4.4
qtconsole==4.3.0
QtPy==1.2.1
requests==2.14.2
rope-py3k==0.9.4.post1
scikit-image==0.13.0
scikit-learn==0.18.2
scipy==0.19.1
seaborn==0.7.1
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.10.0
snowballstemmer==1.2.1
sortedcollections==0.5.3
sortedcontainers==1.5.7
sphinx==1.6.2
sphinxcontrib-websupport==1.0.1
spyder==3.1.4
SQLAlchemy==1.1.11
statsmodels==0.8.0
sympy==1.1
tables==3.2.2
tblib==1.3.2
testpath==0.3
toolz==0.8.2
tornado==4.5.1
traitlets==4.3.2
unicodecsv==0.14.1
wcwidth==0.1.7
Werkzeug==0.12.2
widgetsnbextension==2.0.0
win-unicode-console==0.5
wrapt==1.10.10
xlrd==1.0.0
XlsxWriter==0.9.6
xlwings==0.10.4
xlwt==1.2.0
zict==0.1.2
Let me know if I am doing something wrong.
Thanks,
Adu
> pyarrow causes python to crash errors on parquet.dll
> ----------------------------------------------------
>
> Key: ARROW-1247
> URL: https://issues.apache.org/jira/browse/ARROW-1247
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.4.1
> Environment: Python Version:
> 3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900
> 64 bit (AMD64)]
> Windows Edition: Windows Server 2012 R2
> Reporter: Aditi Breed
>
> Hello,
> I have a script which fetches data, and stores the data in Pandas
> dataframe.
> I make 3 aggregations of data, MEAN/STDEV/MAX, each of which are converted to
> an arrow table and saved on the disk as a parquet file.
> This code works just fine for 100-500 records, but errors out for bigger
> volume. I also know this code works because another developer is using the
> same code on a mirrored machine ( in terms of hardware ) and it works.
> The order of the dataset I am trying to save is millions.
> The code errors out @ line pq.write_table(arrowTable, filePath).
> Here is the code:
> arrowTable = pa.Table.from_pandas(self.grpByMeanDS2)
>
> begintime = datetime.now()
> begintime_str = begintime.strftime("%Y%m%d%I%M%S")
>
> filePath = SaveFileLoc + "\\Raw\\" + agg + "Data" + begintime_str +
> ".parq"
> print('Begin Saving File')
> pq.write_table(arrowTable, filePath)
> print('Done Saving File')
>
> print('Appending FilePath to List')
> self.listspDF.append(filePath)
> print('Done Appending FilePath to List')
>
> Python crashes and throws a "python has to close error".
> Following is the detailed error:
> ------------------
> Problem Event Name: APPCRASH
> Application Name: python.exe
> Application Version: 3.5.2150.1013
> Application Timestamp: 577be340
> Fault Module Name: parquet.dll
> Fault Module Version: 0.0.0.0
> Fault Module Timestamp: 59403662
> Exception Code: c0000005
> Exception Offset: 000000000005f990
> OS Version: 6.3.9600.2.0.0.400.8
> Locale ID: 1033
> Read our privacy statement online:
> http://go.microsoft.com/fwlink/?linkid=280262
> If the online privacy statement is not available, please read our privacy
> statement offline:
> C:\Windows\system32\en-US\erofflps.txt
> --------------------------------------------
> I have tried updating Python and pyarrow, with no luck.
> Following is the version of python:
> import sys
> print (sys.version)
> 3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC
> v.1900 64 bit (AMD64)]
> Following are results of pip freeze:
> alabaster==0.7.9
> anaconda-clean==1.0
> anaconda-client==1.5.1
> anaconda-navigator==1.3.1
> argcomplete==1.0.0
> astroid==1.4.7
> astropy==2.0
> Babel==2.3.4
> backports.shutil-get-terminal-size==1.0.0
> beautifulsoup4==4.5.1
> bitarray==0.8.1
> blaze==0.10.1
> bokeh==0.12.2
> boto==2.42.0
> Bottleneck==1.2.1
> cffi==1.7.0
> chest==0.2.3
> click==6.6
> cloudpickle==0.2.1
> clyent==1.2.2
> colorama==0.3.7
> comtypes==1.1.2
> conda==4.3.22
> conda-build==2.0.2
> configobj==5.0.6
> contextlib2==0.5.3
> cryptography==1.5
> cycler==0.10.0
> Cython==0.24.1
> cytoolz==0.8.0
> dask==0.11.0
> datashape==0.5.2
> decorator==4.0.10
> dill==0.2.5
> docutils==0.12
> dynd===c328ab7
> et-xmlfile==1.0.1
> fastcache==1.0.2
> filelock==2.0.6
> Flask==0.11.1
> Flask-Cors==2.1.2
> gevent==1.1.2
> greenlet==0.4.10
> h5py==2.7.0
> HeapDict==1.0.0
> idna==2.1
> imageio==2.2.0
> imagesize==0.7.1
> ipykernel==4.5.0
> ipython==5.1.0
> ipython-genutils==0.1.0
> ipywidgets==5.2.2
> itsdangerous==0.24
> jdcal==1.2
> jedi==0.9.0
> Jinja2==2.8
> jsonschema==2.5.1
> jupyter==1.0.0
> jupyter-client==4.4.0
> jupyter-console==5.0.0
> jupyter-core==4.2.0
> lazy-object-proxy==1.2.1
> llvmlite==0.19.0
> locket==0.2.0
> lxml==3.6.4
> MarkupSafe==0.23
> matplotlib==2.0.2
> menuinst==1.4.1
> mistune==0.7.3
> mpmath==0.19
> multipledispatch==0.4.8
> nb-anacondacloud==1.2.0
> nb-conda==2.0.0
> nb-conda-kernels==2.0.0
> nbconvert==4.2.0
> nbformat==4.1.0
> nbpresent==3.0.2
> networkx==1.11
> nltk==3.2.1
> nose==1.3.7
> notebook==4.2.3
> numba==0.34.0
> numexpr==2.6.2
> numpy==1.13.1
> odo==0.5.0
> openpyxl==2.3.2
> pandas==0.20.2
> partd==0.3.6
> path.py==0.0.0
> pathlib2==2.1.0
> patsy==0.4.1
> pep8==1.7.0
> pickleshare==0.7.4
> Pillow==3.3.1
> pkginfo==1.3.2
> ply==3.9
> prompt-toolkit==1.0.3
> psutil==4.3.1
> py==1.4.31
> py4j==0.10.4
> pyarrow==0.4.1
> pyasn1==0.1.9
> pycosat==0.6.1
> pycparser==2.14
> pycrypto==2.6.1
> pycurl==7.43.0
> pyflakes==1.3.0
> Pygments==2.1.3
> pyidealdata==0.7.0
> pylint==1.5.4
> pyodbc==4.0.17
> pyOpenSSL==16.2.0
> pyparsing==2.1.4
> pyspark==2.1.0+hadoop2.7
> pytest==2.9.2
> python-dateutil==2.5.3
> pytz==2016.6.1
> PyUber==1.4.4
> PyWavelets==0.5.2
> pywin32==220
> PyYAML==3.12
> pyzmq==15.4.0
> QtAwesome==0.3.3
> qtconsole==4.2.1
> QtPy==1.1.2
> requests==2.14.2
> rope-py3k==0.9.4.post1
> ruamel-yaml===-VERSION
> scikit-image==0.13.0
> scikit-learn==0.18.2
> scipy==0.19.1
> simplegeneric==0.8.1
> singledispatch==3.4.0.3
> six==1.10.0
> snowballstemmer==1.2.1
> sockjs-tornado==1.0.3
> sphinx==1.4.6
> spyder==3.0.0
> SQLAlchemy==1.0.13
> statsmodels==0.8.0
> sympy==1.0
> tables==3.2.2
> toolz==0.8.0
> tornado==4.4.1
> traitlets==4.3.0
> unicodecsv==0.14.1
> wcwidth==0.1.7
> Werkzeug==0.11.11
> widgetsnbextension==1.2.6
> win-unicode-console==0.5
> wrapt==1.10.6
> xlrd==1.0.0
> XlsxWriter==0.9.3
> xlwings==0.10.0
> xlwt==1.1.2
> I was wondering if someone could shed light why pyarrow would not work on a
> certain machine ?
> Thanks,
> Adu
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)