Re: Problem when saving large files via pyarrow hdfs

Wes McKinney Mon, 23 Apr 2018 08:34:08 -0700

hi Jan,

The issue is that the hdfsWrite API uses int32_t (aka "tSize") for write sizes:


https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs-internal.h#L69

So when writing files over INT32_MAX, we must write in chunks. Can you
please open a JIRA with your bug report and this information so that
this can be fixed in a future release?

Thanks!
Wes

On Thu, Apr 19, 2018 at 7:14 AM, Jan-Hendrik Zab <z...@l3s.de> wrote:
>
> Hello!
>
> I'm currently trying to use pyarrows hdfs lib from within hadoop
> streaming, specifically in the reducer with python 3.6 (anaconda). But
> the mentioned problem occurs either way. pyarrow version is 0.9.0
>
> I'm starting the actual python script via a wrapper sh script that sets
> the LD_LIBRARY_PATH, since I found that setting it from wihin python was
> not sufficient..
>
> When I'm just testing the reducer by piping in data manually and trying
> to save data (in this case a gensim model) that is roughly 3GB I only
> get the following error message:
>
>
> File "reducer.py", line 104, in <module>
>   save_model(model)
> File "reducer.py", line 65, in save_model
>   model.save(model_fd, sep_limit=1024 * 1024, pickle_protocol=4)
> File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", 
> line 930, in save
>   super(Word2Vec, self).save(*args, **kwargs)
> File 
> "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", 
> line 281, in save
>   super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs)
> File "/opt/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 688, 
> in save
>   _pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
> File "io.pxi", line 220, in pyarrow.lib.NativeFile.write
> File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: HDFS: Write failed
>
> Files with 700MB in size seem to work fine though. Our default block
> size is 128MB.
>
> The code to save the model is the following:
>
> model = word2vec.Word2Vec(size=300, workers=8, iter=1, sg=1)
> # building model here [removed]
> hdfs_client = hdfs.connect(active_master)
> with hdfs_client.open("/user/zab/w2v/%s_test.model" % key, 'wb') as model_fd:
>     model.save(model_fd, sep_limit=1024 * 1024)
>
> I would appreciate any help :-)
>
> Best,
>         Jan
>
> --
> Leibniz Universität Hannover
> Institut für Verteilte Systeme
> Appelstrasse 4 - 30167 Hannover
> Phone:  +49 (0)511 762 - 17706
> Tax ID/Steuernummer: DE811245527

Re: Problem when saving large files via pyarrow hdfs

Reply via email to