Thank you for your help in getting to the bottom of this. It seems that there is no problem with the C++ code, but the PyArrow/Python 2.7 combination.
Here are more details. I have two C++ programs writing two Arrow files. The first one is the bigger plugin I'm attempting to port and the second one is the small example listed earlier in this thread. The resulting Arrow files cannot be read by PyArrow in Python 2.7 but they work fine in Python 3.8. The Arrow and PyArrow versions match. I'm using 0.16.0 since there is a PyArrow .whl for Python 2.7 in PyPI. Here is the output from Python 2.7: > python Python 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> pyarrow.__version__ '0.16.0' >>> pyarrow.ipc.open_stream('1').read_pandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137, in open_stream return RecordBatchStreamReader(source) File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in __init__ self._open(source) File "pyarrow/ipc.pxi", line 352, in pyarrow.lib._RecordBatchStreamReader._open File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Corrupted message, only 1 bytes available >>> pyarrow.ipc.open_stream('foo').read_pandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137, in open_stream return RecordBatchStreamReader(source) File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in __init__ self._open(source) File "pyarrow/ipc.pxi", line 352, in pyarrow.lib._RecordBatchStreamReader._open File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Corrupted message, only 3 bytes available And here is the output from Python 3.8: > python Python 3.8.3 (default, May 15 2020, 00:00:00) [GCC 10.1.1 20200507 (Red Hat 10.1.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> pyarrow.__version__ '0.16.0' >>> pyarrow.ipc.open_stream('1').read_pandas() x y 0 -10 -10.0 1 -9 NaN 2 -8 -8.0 3 -7 NaN 4 -6 -6.0 5 -5 NaN 6 -4 -4.0 7 -3 NaN 8 -2 -2.0 9 -1 NaN 10 0 0.0 11 1 NaN 12 2 2.0 13 3 NaN 14 4 4.0 15 5 NaN 16 6 6.0 17 7 NaN 18 8 8.0 19 9 NaN 20 10 10.0 >>> pyarrow.ipc.open_stream('foo').read_pandas() foo bar 0 0 0.0 1 1 NaN 2 2 2.0 3 3 NaN 4 4 4.0 5 5 NaN 6 6 6.0 7 7 NaN 8 8 8.0 9 9 NaN 10 10 10.0 11 11 NaN 12 12 12.0 13 13 NaN 14 14 14.0 15 15 NaN 16 16 16.0 17 17 NaN 18 18 18.0 19 19 NaN 20 20 20.0 Is this a bug in PyArrow or some Python 2.7 package issue? Thanks! Rares On Mon, Jun 15, 2020 at 10:55 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Rares, > This last issue sounds like you are trying to write data from 0.16.0 > version of the library and read it from a pre-0.15.0 version of the python > library. If you want to do this you need to set "bool > write_legacy_ipc_format" to true on IpcWriterOptions/IpcOptions object and > construct the StreamWriter with the object. > > -Micah > > > On Mon, Jun 15, 2020 at 10:38 PM Rares Vernica <rvern...@gmail.com> wrote: > > > With open_stream I get a different error: > > > > > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')" > > Traceback (most recent call last): > > File "<string>", line 1, in <module> > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137, > > in open_stream > > return RecordBatchStreamReader(source) > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, > in > > __init__ > > self._open(source) > > File "pyarrow/ipc.pxi", line 352, in > > pyarrow.lib._RecordBatchStreamReader._open > > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > > pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata bytes, but > > only read 4 > > > > > > On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <rvern...@gmail.com> > > wrote: > > > > > > > > I was able to reproduce my issue in a small, fully-contained, > program. > > > Here > > > > is the source code: > > > > > > > > #include <arrow/builder.h> > > > > #include <arrow/io/file.h> > > > > #include <arrow/ipc/writer.h> > > > > #include <arrow/record_batch.h> > > > > > > > > arrow::Status foo() { > > > > std::shared_ptr<arrow::io::OutputStream> arrowStream; > > > > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > > > > std::shared_ptr<arrow::RecordBatch> arrowBatch; > > > > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > > > > > > > > std::vector<std::shared_ptr<arrow::Field>> arrowFields(2); > > > > arrowFields[0] = arrow::field("foo", arrow::int64()); > > > > arrowFields[1] = arrow::field("bar", arrow::int64()); > > > > std::shared_ptr<arrow::Schema> arrowSchema = > > > arrow::schema(arrowFields); > > > > > > > > std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2); > > > > arrow::Int64Builder arrowBuilder; > > > > for (int i = 0; i < 2; i++) { > > > > for (int j = 0; j < 21; j++) > > > > if (i && (j % 2)) > > > > arrowBuilder.AppendNull(); > > > > else > > > > arrowBuilder.Append(j); > > > > ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i])); > > > > } > > > > arrowBatch = arrow::RecordBatch::Make(arrowSchema, > > > > arrowArrays[0]->length(), arrowArrays); > > > > > > > > ARROW_ASSIGN_OR_RAISE(arrowStream, > > > > arrow::io::FileOutputStream::Open("/tmp/foo")); > > > > ARROW_ASSIGN_OR_RAISE(arrowWriter, > > > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema)); > > > > ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch)); > > > > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > > > > ARROW_RETURN_NOT_OK(arrowStream->Close()); > > > > > > > > return arrow::Status::OK(); > > > > } > > > > > > > > int main() { > > > > foo(); > > > > } > > > > > > > > I compile and run it like this: > > > > > > > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo > > > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo > > > > > > > > The file is small and I can't read it from PyArrow: > > > > > > > > > python -c "import pyarrow; > > > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()" > > > > > > Here is your problem. Try `pyarrow.ipc.open_stream`. > > > > > > > Traceback (most recent call last): > > > > File "<string>", line 1, in <module> > > > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line > > 156, > > > > in open_file > > > > return RecordBatchFileReader(source, footer_offset=footer_offset) > > > > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line > > 99, > > > in > > > > __init__ > > > > self._open(source, footer_offset=footer_offset) > > > > File "pyarrow/ipc.pxi", line 398, in > > > > pyarrow.lib._RecordBatchFileReader._open > > > > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > > > > pyarrow.lib.ArrowInvalid: File is too small: 8 > > > > > > > > Here is the Arrow and G++ version: > > > > > > > > > dpkg -s libarrow-dev > > > > Package: libarrow-dev > > > > Status: install ok installed > > > > Priority: optional > > > > Section: libdevel > > > > Installed-Size: 38738 > > > > Maintainer: Apache Arrow Developers <dev@arrow.apache.org> > > > > Architecture: amd64 > > > > Multi-Arch: same > > > > Source: apache-arrow > > > > Version: 0.17.1-1 > > > > Depends: libarrow17 (= 0.17.1-1) > > > > > > > > > g++ --version > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > > > > > > > Does this make sense? > > > > > > > > Cheers, > > > > Rares > > > > > > > > > > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <rvern...@gmail.com> > > > wrote: > > > > > > > > > This is the compiler: > > > > > > > > > > > g++ --version > > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > > > > > > > > > And this is how I compile the code: > > > > > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC > > > -D_STDC_FORMAT_MACROS > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 > -DARROW_NO_DEPRECATED_API > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/" > > > -I"/opt/scidb/19.11/include" > > > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o > > > > > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC > > > -D_STDC_FORMAT_MACROS > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 > -DARROW_NO_DEPRECATED_API > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/" > > > -I"/opt/scidb/19.11/include" > > > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o > PhysicalSplit.o > > > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o > > > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared > > > > > -Wl,-soname,libaccelerated_io_tools.so -L. > > > > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib" > > > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow > > > > > > > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and > PyPI > > > still > > > > > has PyArrow binaries for 2.7. > > > > > > > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the > same. > > I > > > > > also fixed all the deprecation warnings but that did not help > either. > > > > > > > > > > Setting a breakpoint might be a challenge since this code runs as a > > > > > plug-in, but I'll try to isolate this further. > > > > > > > > > > Thanks! > > > > > Rares > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > >> What compiler are you using? > > > > >> > > > > >> In 0.16.0 (what you said you were targeting, though it would be > > better > > > > >> for you to upgrade to 0.17.1) schema is written in the > CheckStarted > > > > >> function here > > > > >> > > > > >> > > > > >> > > > > > > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972 > > > > >> > > > > >> Status CheckStarted() { > > > > >> if (!started_) { > > > > >> return Start(); > > > > >> } > > > > >> return Status::OK(); > > > > >> } > > > > >> > > > > >> started_ is set to false by a default member initializer in the > > > > >> protected block. Maybe you should set a breakpoint in this > function > > > > >> and see if for some reason started_ is true on the first > invocation > > > > >> (in which case it makes me wonder if there is something > > > > >> not-fully-C++11-compliant about your toolchain). > > > > >> > > > > >> Otherwise I'm a bit stumped since there are lots of production > > > > >> applications that use this code. > > > > >> > > > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica < > rvern...@gmail.com> > > > > >> wrote: > > > > >> > > > > > >> > Sure, here is briefly what I'm doing: > > > > >> > > > > > >> > bool append = false; > > > > >> > std::shared_ptr<arrow::io::OutputStream> arrowStream; > > > > >> > auto arrowResult = > arrow::io::FileOutputStream::Open(fileName, > > > > >> append); > > > > >> > arrowStream = arrowResult.ValueOrDie(); > > > > >> > > > > > >> > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > > > > >> > std::shared_ptr<arrow::RecordBatch> arrowBatch; > > > > >> > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > > > > >> > > > > > >> > std::shared_ptr<arrow::Schema> arrowSchema = > > > attributes2ArrowSchema( > > > > >> > inputSchema, settings.isAttsOnly()); > > > > >> > ARROW_RETURN_NOT_OK( > > > > >> > arrow::ipc::RecordBatchStreamWriter::Open( > > > > >> > arrowStream.get(), arrowSchema, &arrowWriter)); > > > > >> > > > > > >> > // Setup "arrowReader" using BufferReader and > > > > >> RecordBatchStreamReader > > > > >> > ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch)); > > > > >> > ARROW_RETURN_NOT_OK( > > > > >> > arrowWriter->WriteRecordBatch(*arrowBatch)); > > > > >> > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > > > > >> > ARROW_RETURN_NOT_OK(arrowStream->Close()); > > > > >> > > > > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney < > wesmck...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > > >> > > Can you show the code you are writing? The first thing the > > stream > > > > >> writer > > > > >> > > does before writing any record batch is write the schema. It > > > sounds > > > > >> like > > > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere. > > > > >> > > > > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica < > > rvern...@gmail.com> > > > > >> wrote: > > > > >> > > > > > > >> > > > Hello, > > > > >> > > > > > > > >> > > > I have a RecordBatch that I would like to write to a file. > I'm > > > using > > > > >> > > > FileOutputStream::Open to open the file and > > > > >> RecordBatchStreamWriter::Open > > > > >> > > > to open the stream. I write a record batch with > > > WriteRecordBatch. > > > > >> > > Finally, > > > > >> > > > I close the RecordBatchWriter and OutputStream. > > > > >> > > > > > > > >> > > > The resulting file size is exactly the size of the Buffer > used > > > to > > > > >> store > > > > >> > > the > > > > >> > > > RecordBatch. It looks like it is missing the schema. When I > > try > > > to > > > > >> open > > > > >> > > the > > > > >> > > > resulting file from PyArrow I get: > > > > >> > > > > > > > >> > > > >>> pa.ipc.open_file('/tmp/1') > > > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6 > > > > >> > > > > > > > >> > > > $ ll /tmp/1 > > > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1 > > > > >> > > > > > > > >> > > > How can I write the schema as well? > > > > >> > > > > > > > >> > > > I was browsing the documentation at > > > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't > > locate > > > > >> any C++ > > > > >> > > > documentation about RecordBatchStreamWriter or > > > RecordBatchWriter. > > > > >> Is this > > > > >> > > > intentional? > > > > >> > > > > > > > >> > > > Thank you! > > > > >> > > > Rares > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > >