Re: C++ Write Schema with RecordBatchStreamWriter

Rares Vernica Tue, 16 Jun 2020 12:24:09 -0700

Thanks a lot, Wes! That was the issue. Good catch!

On Tue, Jun 16, 2020 at 9:39 AM Wes McKinney <[email protected]> wrote:


> It looks like on Python 2.7 that the open_stream/open_file functions
> are treating the file name that you are passing as a binary buffer
> rather than a file path (inferring from the fact that '1' is one byte
> in Py2.7 and 'foo' is 3 bytes). Try passing an open file handle
> instead
>
> On Tue, Jun 16, 2020 at 11:28 AM Rares Vernica <[email protected]> wrote:
> >
> > Thank you for your help in getting to the bottom of this.  It seems that
> > there is no problem with the C++ code, but the PyArrow/Python 2.7
> > combination.
> >
> > Here are more details. I have two C++ programs writing two Arrow files.
> The
> > first one is the bigger plugin I'm attempting to port and the second one
> is
> > the small example listed earlier in this thread. The resulting Arrow
> files
> > cannot be read by PyArrow in Python 2.7 but they work fine in Python 3.8.
> > The Arrow and PyArrow versions match. I'm using 0.16.0 since there is a
> > PyArrow .whl for Python 2.7 in PyPI.
> >
> > Here is the output from Python 2.7:
> >
> > > python
> > Python 2.7.12 (default, Apr 15 2020, 17:07:12)
> > [GCC 5.4.0 20160609] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import pyarrow
> > >>> pyarrow.__version__
> > '0.16.0'
> > >>> pyarrow.ipc.open_stream('1').read_pandas()
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> > in open_stream
> >     return RecordBatchStreamReader(source)
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61,
> in
> > __init__
> >     self._open(source)
> >   File "pyarrow/ipc.pxi", line 352, in
> > pyarrow.lib._RecordBatchStreamReader._open
> >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > pyarrow.lib.ArrowInvalid: Corrupted message, only 1 bytes available
> > >>> pyarrow.ipc.open_stream('foo').read_pandas()
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> > in open_stream
> >     return RecordBatchStreamReader(source)
> >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61,
> in
> > __init__
> >     self._open(source)
> >   File "pyarrow/ipc.pxi", line 352, in
> > pyarrow.lib._RecordBatchStreamReader._open
> >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > pyarrow.lib.ArrowInvalid: Corrupted message, only 3 bytes available
> >
> > And here is the output from Python 3.8:
> >
> > > python
> > Python 3.8.3 (default, May 15 2020, 00:00:00)
> > [GCC 10.1.1 20200507 (Red Hat 10.1.1-1)] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import pyarrow
> > >>> pyarrow.__version__
> > '0.16.0'
> > >>> pyarrow.ipc.open_stream('1').read_pandas()
> >      x     y
> > 0  -10 -10.0
> > 1   -9   NaN
> > 2   -8  -8.0
> > 3   -7   NaN
> > 4   -6  -6.0
> > 5   -5   NaN
> > 6   -4  -4.0
> > 7   -3   NaN
> > 8   -2  -2.0
> > 9   -1   NaN
> > 10   0   0.0
> > 11   1   NaN
> > 12   2   2.0
> > 13   3   NaN
> > 14   4   4.0
> > 15   5   NaN
> > 16   6   6.0
> > 17   7   NaN
> > 18   8   8.0
> > 19   9   NaN
> > 20  10  10.0
> > >>> pyarrow.ipc.open_stream('foo').read_pandas()
> >     foo   bar
> > 0     0   0.0
> > 1     1   NaN
> > 2     2   2.0
> > 3     3   NaN
> > 4     4   4.0
> > 5     5   NaN
> > 6     6   6.0
> > 7     7   NaN
> > 8     8   8.0
> > 9     9   NaN
> > 10   10  10.0
> > 11   11   NaN
> > 12   12  12.0
> > 13   13   NaN
> > 14   14  14.0
> > 15   15   NaN
> > 16   16  16.0
> > 17   17   NaN
> > 18   18  18.0
> > 19   19   NaN
> > 20   20  20.0
> >
> > Is this a bug in PyArrow or some Python 2.7 package issue?
> >
> > Thanks!
> > Rares
> >
> > On Mon, Jun 15, 2020 at 10:55 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > Hi Rares,
> > > This last issue sounds like you are trying to write data from 0.16.0
> > > version of the library and read it from a pre-0.15.0 version of the
> python
> > > library.  If you want to do this you need to set  "bool
> > > write_legacy_ipc_format" to true on IpcWriterOptions/IpcOptions object
> and
> > > construct the StreamWriter with the object.
> > >
> > > -Micah
> > >
> > >
> > > On Mon, Jun 15, 2020 at 10:38 PM Rares Vernica <[email protected]>
> wrote:
> > >
> > > > With open_stream I get a different error:
> > > >
> > > > > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')"
> > > > Traceback (most recent call last):
> > > >   File "<string>", line 1, in <module>
> > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> 137,
> > > > in open_stream
> > > >     return RecordBatchStreamReader(source)
> > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> 61,
> > > in
> > > > __init__
> > > >     self._open(source)
> > > >   File "pyarrow/ipc.pxi", line 352, in
> > > > pyarrow.lib._RecordBatchStreamReader._open
> > > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > > pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata
> bytes, but
> > > > only read 4
> > > >
> > > >
> > > > On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > I was able to reproduce my issue in a small, fully-contained,
> > > program.
> > > > > Here
> > > > > > is the source code:
> > > > > >
> > > > > > #include <arrow/builder.h>
> > > > > > #include <arrow/io/file.h>
> > > > > > #include <arrow/ipc/writer.h>
> > > > > > #include <arrow/record_batch.h>
> > > > > >
> > > > > > arrow::Status foo() {
> > > > > >   std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > > > >   std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > > > >   std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > > > >   std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > > > >
> > > > > >   std::vector<std::shared_ptr<arrow::Field>> arrowFields(2);
> > > > > >   arrowFields[0] = arrow::field("foo", arrow::int64());
> > > > > >   arrowFields[1] = arrow::field("bar", arrow::int64());
> > > > > >   std::shared_ptr<arrow::Schema> arrowSchema =
> > > > > arrow::schema(arrowFields);
> > > > > >
> > > > > >   std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2);
> > > > > >   arrow::Int64Builder arrowBuilder;
> > > > > >   for (int i = 0; i < 2; i++) {
> > > > > >     for (int j = 0; j < 21; j++)
> > > > > >       if (i && (j % 2))
> > > > > >         arrowBuilder.AppendNull();
> > > > > >       else
> > > > > >         arrowBuilder.Append(j);
> > > > > >     ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i]));
> > > > > >   }
> > > > > >   arrowBatch = arrow::RecordBatch::Make(arrowSchema,
> > > > > > arrowArrays[0]->length(), arrowArrays);
> > > > > >
> > > > > >   ARROW_ASSIGN_OR_RAISE(arrowStream,
> > > > > > arrow::io::FileOutputStream::Open("/tmp/foo"));
> > > > > >   ARROW_ASSIGN_OR_RAISE(arrowWriter,
> > > > > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema));
> > > > > >
>  ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch));
> > > > > >   ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > > > >   ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > > > >
> > > > > >   return arrow::Status::OK();
> > > > > > }
> > > > > >
> > > > > > int main() {
> > > > > >   foo();
> > > > > > }
> > > > > >
> > > > > > I compile and run it like this:
> > > > > >
> > > > > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo
> > > > > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo
> > > > > >
> > > > > > The file is small and I can't read it from PyArrow:
> > > > > >
> > > > > > > python -c "import pyarrow;
> > > > > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
> > > > >
> > > > > Here is your problem. Try `pyarrow.ipc.open_stream`.
> > > > >
> > > > > > Traceback (most recent call last):
> > > > > >   File "<string>", line 1, in <module>
> > > > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py",
> line
> > > > 156,
> > > > > > in open_file
> > > > > >     return RecordBatchFileReader(source,
> footer_offset=footer_offset)
> > > > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py",
> line
> > > > 99,
> > > > > in
> > > > > > __init__
> > > > > >     self._open(source, footer_offset=footer_offset)
> > > > > >   File "pyarrow/ipc.pxi", line 398, in
> > > > > > pyarrow.lib._RecordBatchFileReader._open
> > > > > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > > > > pyarrow.lib.ArrowInvalid: File is too small: 8
> > > > > >
> > > > > > Here is the Arrow and G++ version:
> > > > > >
> > > > > > > dpkg -s libarrow-dev
> > > > > > Package: libarrow-dev
> > > > > > Status: install ok installed
> > > > > > Priority: optional
> > > > > > Section: libdevel
> > > > > > Installed-Size: 38738
> > > > > > Maintainer: Apache Arrow Developers <[email protected]>
> > > > > > Architecture: amd64
> > > > > > Multi-Arch: same
> > > > > > Source: apache-arrow
> > > > > > Version: 0.17.1-1
> > > > > > Depends: libarrow17 (= 0.17.1-1)
> > > > > >
> > > > > > > g++ --version
> > > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > > > >
> > > > > > Does this make sense?
> > > > > >
> > > > > > Cheers,
> > > > > > Rares
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > This is the compiler:
> > > > > > >
> > > > > > > > g++ --version
> > > > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > > > > >
> > > > > > > And this is how I compile the code:
> > > > > > >
> > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > > > -D_STDC_FORMAT_MACROS
> > > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> > > -DARROW_NO_DEPRECATED_API
> > > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > > > -I"/opt/scidb/19.11/include"
> > > > > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o
> > > > > > >
> > > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > > > -D_STDC_FORMAT_MACROS
> > > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> > > -DARROW_NO_DEPRECATED_API
> > > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > > > -I"/opt/scidb/19.11/include"
> > > > > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o
> > > PhysicalSplit.o
> > > > > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o
> PhysicalAioInput.o
> > > > > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
> > > > > > > -Wl,-soname,libaccelerated_io_tools.so -L.
> > > > > > > -L"/opt/scidb/19.11/3rdparty/boost/lib"
> -L"/opt/scidb/19.11/lib"
> > > > > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow
> > > > > > >
> > > > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and
> > > PyPI
> > > > > still
> > > > > > > has PyArrow binaries for 2.7.
> > > > > > >
> > > > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the
> > > same.
> > > > I
> > > > > > > also fixed all the deprecation warnings but that did not help
> > > either.
> > > > > > >
> > > > > > > Setting a breakpoint might be a challenge since this code runs
> as a
> > > > > > > plug-in, but I'll try to isolate this further.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Rares
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > >> What compiler are you using?
> > > > > > >>
> > > > > > >> In 0.16.0 (what you said you were targeting, though it would
> be
> > > > better
> > > > > > >> for you to upgrade to 0.17.1) schema is written in the
> > > CheckStarted
> > > > > > >> function here
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
> > > > > > >>
> > > > > > >> Status CheckStarted() {
> > > > > > >>   if (!started_) {
> > > > > > >>     return Start();
> > > > > > >>   }
> > > > > > >>   return Status::OK();
> > > > > > >> }
> > > > > > >>
> > > > > > >> started_ is set to false by a default member initializer in
> the
> > > > > > >> protected block. Maybe you should set a breakpoint in this
> > > function
> > > > > > >> and see if for some reason started_ is true on the first
> > > invocation
> > > > > > >> (in which case it makes me wonder if there is something
> > > > > > >> not-fully-C++11-compliant about your toolchain).
> > > > > > >>
> > > > > > >> Otherwise I'm a bit stumped since there are lots of production
> > > > > > >> applications that use this code.
> > > > > > >>
> > > > > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <
> > > [email protected]>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > Sure, here is briefly what I'm doing:
> > > > > > >> >
> > > > > > >> >     bool append = false;
> > > > > > >> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > > > > >> >     auto arrowResult =
> > > arrow::io::FileOutputStream::Open(fileName,
> > > > > > >> append);
> > > > > > >> >     arrowStream = arrowResult.ValueOrDie();
> > > > > > >> >
> > > > > > >> >     std::shared_ptr<arrow::ipc::RecordBatchWriter>
> arrowWriter;
> > > > > > >> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > > > > >> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > > > > >> >
> > > > > > >> >     std::shared_ptr<arrow::Schema> arrowSchema =
> > > > > attributes2ArrowSchema(
> > > > > > >> >             inputSchema, settings.isAttsOnly());
> > > > > > >> >     ARROW_RETURN_NOT_OK(
> > > > > > >> >             arrow::ipc::RecordBatchStreamWriter::Open(
> > > > > > >> >                 arrowStream.get(), arrowSchema,
> &arrowWriter));
> > > > > > >> >
> > > > > > >> >     // Setup "arrowReader" using BufferReader and
> > > > > > >> RecordBatchStreamReader
> > > > > > >> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
> > > > > > >> >     ARROW_RETURN_NOT_OK(
> > > > > > >> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
> > > > > > >> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > > > > >> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > > > > >> >
> > > > > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <
> > > [email protected]
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Can you show the code you are writing? The first thing the
> > > > stream
> > > > > > >> writer
> > > > > > >> > > does before writing any record batch is write the schema.
> It
> > > > > sounds
> > > > > > >> like
> > > > > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere.
> > > > > > >> > >
> > > > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <
> > > > [email protected]>
> > > > > > >> wrote:
> > > > > > >> > >
> > > > > > >> > > > Hello,
> > > > > > >> > > >
> > > > > > >> > > > I have a RecordBatch that I would like to write to a
> file.
> > > I'm
> > > > > using
> > > > > > >> > > > FileOutputStream::Open to open the file and
> > > > > > >> RecordBatchStreamWriter::Open
> > > > > > >> > > > to open the stream. I write a record batch with
> > > > > WriteRecordBatch.
> > > > > > >> > > Finally,
> > > > > > >> > > > I close the RecordBatchWriter and OutputStream.
> > > > > > >> > > >
> > > > > > >> > > > The resulting file size is exactly the size of the
> Buffer
> > > used
> > > > > to
> > > > > > >> store
> > > > > > >> > > the
> > > > > > >> > > > RecordBatch. It looks like it is missing the schema.
> When I
> > > > try
> > > > > to
> > > > > > >> open
> > > > > > >> > > the
> > > > > > >> > > > resulting file from PyArrow I get:
> > > > > > >> > > >
> > > > > > >> > > > >>> pa.ipc.open_file('/tmp/1')
> > > > > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
> > > > > > >> > > >
> > > > > > >> > > > $ ll /tmp/1
> > > > > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
> > > > > > >> > > >
> > > > > > >> > > > How can I write the schema as well?
> > > > > > >> > > >
> > > > > > >> > > > I was browsing the documentation at
> > > > > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I
> can't
> > > > locate
> > > > > > >> any C++
> > > > > > >> > > > documentation about RecordBatchStreamWriter or
> > > > > RecordBatchWriter.
> > > > > > >> Is this
> > > > > > >> > > > intentional?
> > > > > > >> > > >
> > > > > > >> > > > Thank you!
> > > > > > >> > > > Rares
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >>
> > > > > > >
> > > > >
> > > >
> > >
>

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to