hi Jan, The issue is that the hdfsWrite API uses int32_t (aka "tSize") for write sizes:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs-internal.h#L69 So when writing files over INT32_MAX, we must write in chunks. Can you please open a JIRA with your bug report and this information so that this can be fixed in a future release? Thanks! Wes On Thu, Apr 19, 2018 at 7:14 AM, Jan-Hendrik Zab <z...@l3s.de> wrote: > > Hello! > > I'm currently trying to use pyarrows hdfs lib from within hadoop > streaming, specifically in the reducer with python 3.6 (anaconda). But > the mentioned problem occurs either way. pyarrow version is 0.9.0 > > I'm starting the actual python script via a wrapper sh script that sets > the LD_LIBRARY_PATH, since I found that setting it from wihin python was > not sufficient.. > > When I'm just testing the reducer by piping in data manually and trying > to save data (in this case a gensim model) that is roughly 3GB I only > get the following error message: > > > File "reducer.py", line 104, in <module> > save_model(model) > File "reducer.py", line 65, in save_model > model.save(model_fd, sep_limit=1024 * 1024, pickle_protocol=4) > File "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/word2vec.py", > line 930, in save > super(Word2Vec, self).save(*args, **kwargs) > File > "/opt/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", > line 281, in save > super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs) > File "/opt/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 688, > in save > _pickle.dump(self, fname_or_handle, protocol=pickle_protocol) > File "io.pxi", line 220, in pyarrow.lib.NativeFile.write > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: HDFS: Write failed > > Files with 700MB in size seem to work fine though. Our default block > size is 128MB. > > The code to save the model is the following: > > model = word2vec.Word2Vec(size=300, workers=8, iter=1, sg=1) > # building model here [removed] > hdfs_client = hdfs.connect(active_master) > with hdfs_client.open("/user/zab/w2v/%s_test.model" % key, 'wb') as model_fd: > model.save(model_fd, sep_limit=1024 * 1024) > > I would appreciate any help :-) > > Best, > Jan > > -- > Leibniz Universität Hannover > Institut für Verteilte Systeme > Appelstrasse 4 - 30167 Hannover > Phone: +49 (0)511 762 - 17706 > Tax ID/Steuernummer: DE811245527