Re: C++ Write Schema with RecordBatchStreamWriter

Rares Vernica Tue, 16 Jun 2020 09:28:21 -0700

Thank you for your help in getting to the bottom of this.  It seems that
there is no problem with the C++ code, but the PyArrow/Python 2.7
combination.


Here are more details. I have two C++ programs writing two Arrow files. The
first one is the bigger plugin I'm attempting to port and the second one is
the small example listed earlier in this thread. The resulting Arrow files
cannot be read by PyArrow in Python 2.7 but they work fine in Python 3.8.
The Arrow and PyArrow versions match. I'm using 0.16.0 since there is a
PyArrow .whl for Python 2.7 in PyPI.

Here is the output from Python 2.7:

> python
Python 2.7.12 (default, Apr 15 2020, 17:07:12)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'0.16.0'
>>> pyarrow.ipc.open_stream('1').read_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
in open_stream
    return RecordBatchStreamReader(source)
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in
__init__
    self._open(source)
  File "pyarrow/ipc.pxi", line 352, in
pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Corrupted message, only 1 bytes available
>>> pyarrow.ipc.open_stream('foo').read_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
in open_stream
    return RecordBatchStreamReader(source)
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in
__init__
    self._open(source)
  File "pyarrow/ipc.pxi", line 352, in
pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Corrupted message, only 3 bytes available

And here is the output from Python 3.8:

> python
Python 3.8.3 (default, May 15 2020, 00:00:00)
[GCC 10.1.1 20200507 (Red Hat 10.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'0.16.0'
>>> pyarrow.ipc.open_stream('1').read_pandas()
     x     y
0  -10 -10.0
1   -9   NaN
2   -8  -8.0
3   -7   NaN
4   -6  -6.0
5   -5   NaN
6   -4  -4.0
7   -3   NaN
8   -2  -2.0
9   -1   NaN
10   0   0.0
11   1   NaN
12   2   2.0
13   3   NaN
14   4   4.0
15   5   NaN
16   6   6.0
17   7   NaN
18   8   8.0
19   9   NaN
20  10  10.0
>>> pyarrow.ipc.open_stream('foo').read_pandas()
    foo   bar
0     0   0.0
1     1   NaN
2     2   2.0
3     3   NaN
4     4   4.0
5     5   NaN
6     6   6.0
7     7   NaN
8     8   8.0
9     9   NaN
10   10  10.0
11   11   NaN
12   12  12.0
13   13   NaN
14   14  14.0
15   15   NaN
16   16  16.0
17   17   NaN
18   18  18.0
19   19   NaN
20   20  20.0

Is this a bug in PyArrow or some Python 2.7 package issue?

Thanks!
Rares

On Mon, Jun 15, 2020 at 10:55 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Rares,
> This last issue sounds like you are trying to write data from 0.16.0
> version of the library and read it from a pre-0.15.0 version of the python
> library.  If you want to do this you need to set  "bool
> write_legacy_ipc_format" to true on IpcWriterOptions/IpcOptions object and
> construct the StreamWriter with the object.
>
> -Micah
>
>
> On Mon, Jun 15, 2020 at 10:38 PM Rares Vernica <rvern...@gmail.com> wrote:
>
> > With open_stream I get a different error:
> >
> > > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')"
> > Traceback (most recent call last):
> >   File "<string>", line 1, in <module>
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> > in open_stream
> >     return RecordBatchStreamReader(source)
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61,
> in
> > __init__
> >     self._open(source)
> >   File "pyarrow/ipc.pxi", line 352, in
> > pyarrow.lib._RecordBatchStreamReader._open
> >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata bytes, but
> > only read 4
> >
> >
> > On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <rvern...@gmail.com>
> > wrote:
> > > >
> > > > I was able to reproduce my issue in a small, fully-contained,
> program.
> > > Here
> > > > is the source code:
> > > >
> > > > #include <arrow/builder.h>
> > > > #include <arrow/io/file.h>
> > > > #include <arrow/ipc/writer.h>
> > > > #include <arrow/record_batch.h>
> > > >
> > > > arrow::Status foo() {
> > > >   std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > >   std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > >   std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > >   std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > >
> > > >   std::vector<std::shared_ptr<arrow::Field>> arrowFields(2);
> > > >   arrowFields[0] = arrow::field("foo", arrow::int64());
> > > >   arrowFields[1] = arrow::field("bar", arrow::int64());
> > > >   std::shared_ptr<arrow::Schema> arrowSchema =
> > > arrow::schema(arrowFields);
> > > >
> > > >   std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2);
> > > >   arrow::Int64Builder arrowBuilder;
> > > >   for (int i = 0; i < 2; i++) {
> > > >     for (int j = 0; j < 21; j++)
> > > >       if (i && (j % 2))
> > > >         arrowBuilder.AppendNull();
> > > >       else
> > > >         arrowBuilder.Append(j);
> > > >     ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i]));
> > > >   }
> > > >   arrowBatch = arrow::RecordBatch::Make(arrowSchema,
> > > > arrowArrays[0]->length(), arrowArrays);
> > > >
> > > >   ARROW_ASSIGN_OR_RAISE(arrowStream,
> > > > arrow::io::FileOutputStream::Open("/tmp/foo"));
> > > >   ARROW_ASSIGN_OR_RAISE(arrowWriter,
> > > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema));
> > > >   ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch));
> > > >   ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > >   ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > >
> > > >   return arrow::Status::OK();
> > > > }
> > > >
> > > > int main() {
> > > >   foo();
> > > > }
> > > >
> > > > I compile and run it like this:
> > > >
> > > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo
> > > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo
> > > >
> > > > The file is small and I can't read it from PyArrow:
> > > >
> > > > > python -c "import pyarrow;
> > > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
> > >
> > > Here is your problem. Try `pyarrow.ipc.open_stream`.
> > >
> > > > Traceback (most recent call last):
> > > >   File "<string>", line 1, in <module>
> > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> > 156,
> > > > in open_file
> > > >     return RecordBatchFileReader(source, footer_offset=footer_offset)
> > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> > 99,
> > > in
> > > > __init__
> > > >     self._open(source, footer_offset=footer_offset)
> > > >   File "pyarrow/ipc.pxi", line 398, in
> > > > pyarrow.lib._RecordBatchFileReader._open
> > > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > > pyarrow.lib.ArrowInvalid: File is too small: 8
> > > >
> > > > Here is the Arrow and G++ version:
> > > >
> > > > > dpkg -s libarrow-dev
> > > > Package: libarrow-dev
> > > > Status: install ok installed
> > > > Priority: optional
> > > > Section: libdevel
> > > > Installed-Size: 38738
> > > > Maintainer: Apache Arrow Developers <dev@arrow.apache.org>
> > > > Architecture: amd64
> > > > Multi-Arch: same
> > > > Source: apache-arrow
> > > > Version: 0.17.1-1
> > > > Depends: libarrow17 (= 0.17.1-1)
> > > >
> > > > > g++ --version
> > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > >
> > > > Does this make sense?
> > > >
> > > > Cheers,
> > > > Rares
> > > >
> > > >
> > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <rvern...@gmail.com>
> > > wrote:
> > > >
> > > > > This is the compiler:
> > > > >
> > > > > > g++ --version
> > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > > >
> > > > > And this is how I compile the code:
> > > > >
> > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > -D_STDC_FORMAT_MACROS
> > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> -DARROW_NO_DEPRECATED_API
> > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > -I"/opt/scidb/19.11/include"
> > > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o
> > > > >
> > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > -D_STDC_FORMAT_MACROS
> > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> -DARROW_NO_DEPRECATED_API
> > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > -I"/opt/scidb/19.11/include"
> > > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o
> PhysicalSplit.o
> > > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o
> > > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
> > > > > -Wl,-soname,libaccelerated_io_tools.so -L.
> > > > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib"
> > > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow
> > > > >
> > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and
> PyPI
> > > still
> > > > > has PyArrow binaries for 2.7.
> > > > >
> > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the
> same.
> > I
> > > > > also fixed all the deprecation warnings but that did not help
> either.
> > > > >
> > > > > Setting a breakpoint might be a challenge since this code runs as a
> > > > > plug-in, but I'll try to isolate this further.
> > > > >
> > > > > Thanks!
> > > > > Rares
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > > >
> > > > >> What compiler are you using?
> > > > >>
> > > > >> In 0.16.0 (what you said you were targeting, though it would be
> > better
> > > > >> for you to upgrade to 0.17.1) schema is written in the
> CheckStarted
> > > > >> function here
> > > > >>
> > > > >>
> > > > >>
> > >
> >
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
> > > > >>
> > > > >> Status CheckStarted() {
> > > > >>   if (!started_) {
> > > > >>     return Start();
> > > > >>   }
> > > > >>   return Status::OK();
> > > > >> }
> > > > >>
> > > > >> started_ is set to false by a default member initializer in the
> > > > >> protected block. Maybe you should set a breakpoint in this
> function
> > > > >> and see if for some reason started_ is true on the first
> invocation
> > > > >> (in which case it makes me wonder if there is something
> > > > >> not-fully-C++11-compliant about your toolchain).
> > > > >>
> > > > >> Otherwise I'm a bit stumped since there are lots of production
> > > > >> applications that use this code.
> > > > >>
> > > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <
> rvern...@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > Sure, here is briefly what I'm doing:
> > > > >> >
> > > > >> >     bool append = false;
> > > > >> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > > >> >     auto arrowResult =
> arrow::io::FileOutputStream::Open(fileName,
> > > > >> append);
> > > > >> >     arrowStream = arrowResult.ValueOrDie();
> > > > >> >
> > > > >> >     std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > > >> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > > >> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > > >> >
> > > > >> >     std::shared_ptr<arrow::Schema> arrowSchema =
> > > attributes2ArrowSchema(
> > > > >> >             inputSchema, settings.isAttsOnly());
> > > > >> >     ARROW_RETURN_NOT_OK(
> > > > >> >             arrow::ipc::RecordBatchStreamWriter::Open(
> > > > >> >                 arrowStream.get(), arrowSchema, &arrowWriter));
> > > > >> >
> > > > >> >     // Setup "arrowReader" using BufferReader and
> > > > >> RecordBatchStreamReader
> > > > >> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
> > > > >> >     ARROW_RETURN_NOT_OK(
> > > > >> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
> > > > >> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > > >> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > > >> >
> > > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <
> wesmck...@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Can you show the code you are writing? The first thing the
> > stream
> > > > >> writer
> > > > >> > > does before writing any record batch is write the schema. It
> > > sounds
> > > > >> like
> > > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere.
> > > > >> > >
> > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <
> > rvern...@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hello,
> > > > >> > > >
> > > > >> > > > I have a RecordBatch that I would like to write to a file.
> I'm
> > > using
> > > > >> > > > FileOutputStream::Open to open the file and
> > > > >> RecordBatchStreamWriter::Open
> > > > >> > > > to open the stream. I write a record batch with
> > > WriteRecordBatch.
> > > > >> > > Finally,
> > > > >> > > > I close the RecordBatchWriter and OutputStream.
> > > > >> > > >
> > > > >> > > > The resulting file size is exactly the size of the Buffer
> used
> > > to
> > > > >> store
> > > > >> > > the
> > > > >> > > > RecordBatch. It looks like it is missing the schema. When I
> > try
> > > to
> > > > >> open
> > > > >> > > the
> > > > >> > > > resulting file from PyArrow I get:
> > > > >> > > >
> > > > >> > > > >>> pa.ipc.open_file('/tmp/1')
> > > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
> > > > >> > > >
> > > > >> > > > $ ll /tmp/1
> > > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
> > > > >> > > >
> > > > >> > > > How can I write the schema as well?
> > > > >> > > >
> > > > >> > > > I was browsing the documentation at
> > > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't
> > locate
> > > > >> any C++
> > > > >> > > > documentation about RecordBatchStreamWriter or
> > > RecordBatchWriter.
> > > > >> Is this
> > > > >> > > > intentional?
> > > > >> > > >
> > > > >> > > > Thank you!
> > > > >> > > > Rares
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > > >
> > >
> >
>

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to