Re: C++ Write Schema with RecordBatchStreamWriter

Rares Vernica Mon, 15 Jun 2020 21:24:14 -0700

I was able to reproduce my issue in a small, fully-contained, program. Here
is the source code:


#include <arrow/builder.h>
#include <arrow/io/file.h>
#include <arrow/ipc/writer.h>
#include <arrow/record_batch.h>

arrow::Status foo() {
  std::shared_ptr<arrow::io::OutputStream> arrowStream;
  std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
  std::shared_ptr<arrow::RecordBatch> arrowBatch;
  std::shared_ptr<arrow::RecordBatchReader> arrowReader;

  std::vector<std::shared_ptr<arrow::Field>> arrowFields(2);
  arrowFields[0] = arrow::field("foo", arrow::int64());
  arrowFields[1] = arrow::field("bar", arrow::int64());
  std::shared_ptr<arrow::Schema> arrowSchema = arrow::schema(arrowFields);

  std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2);
  arrow::Int64Builder arrowBuilder;
  for (int i = 0; i < 2; i++) {
    for (int j = 0; j < 21; j++)
      if (i && (j % 2))
        arrowBuilder.AppendNull();
      else
        arrowBuilder.Append(j);
    ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i]));
  }
  arrowBatch = arrow::RecordBatch::Make(arrowSchema,
arrowArrays[0]->length(), arrowArrays);

  ARROW_ASSIGN_OR_RAISE(arrowStream,
arrow::io::FileOutputStream::Open("/tmp/foo"));
  ARROW_ASSIGN_OR_RAISE(arrowWriter,
arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema));
  ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch));
  ARROW_RETURN_NOT_OK(arrowWriter->Close());
  ARROW_RETURN_NOT_OK(arrowStream->Close());

  return arrow::Status::OK();
}

int main() {
  foo();
}

I compile and run it like this:

> g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo
-rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo

The file is small and I can't read it from PyArrow:

> python -c "import pyarrow;
pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 156,
in open_file
    return RecordBatchFileReader(source, footer_offset=footer_offset)
  File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 99, in
__init__
    self._open(source, footer_offset=footer_offset)
  File "pyarrow/ipc.pxi", line 398, in
pyarrow.lib._RecordBatchFileReader._open
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: File is too small: 8

Here is the Arrow and G++ version:

> dpkg -s libarrow-dev
Package: libarrow-dev
Status: install ok installed
Priority: optional
Section: libdevel
Installed-Size: 38738
Maintainer: Apache Arrow Developers <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: apache-arrow
Version: 0.17.1-1
Depends: libarrow17 (= 0.17.1-1)

> g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

Does this make sense?

Cheers,
Rares


On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <[email protected]> wrote:

> This is the compiler:
>
> > g++ --version
> g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
>
> And this is how I compile the code:
>
> g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS
> -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
> -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include"
> -c PhysicalAioSave.cpp -o PhysicalAioSave.o
>
> g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS
> -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
> -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include"
> -o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o
> LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o
> LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
> -Wl,-soname,libaccelerated_io_tools.so -L.
> -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib"
> -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow
>
> We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI still
> has PyArrow binaries for 2.7.
>
> Anyway, I temporarily upgraded to 0.17.1 but the result is the same. I
> also fixed all the deprecation warnings but that did not help either.
>
> Setting a breakpoint might be a challenge since this code runs as a
> plug-in, but I'll try to isolate this further.
>
> Thanks!
> Rares
>
>
>
>
> On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <[email protected]> wrote:
>
>> What compiler are you using?
>>
>> In 0.16.0 (what you said you were targeting, though it would be better
>> for you to upgrade to 0.17.1) schema is written in the CheckStarted
>> function here
>>
>>
>> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
>>
>> Status CheckStarted() {
>>   if (!started_) {
>>     return Start();
>>   }
>>   return Status::OK();
>> }
>>
>> started_ is set to false by a default member initializer in the
>> protected block. Maybe you should set a breakpoint in this function
>> and see if for some reason started_ is true on the first invocation
>> (in which case it makes me wonder if there is something
>> not-fully-C++11-compliant about your toolchain).
>>
>> Otherwise I'm a bit stumped since there are lots of production
>> applications that use this code.
>>
>> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <[email protected]>
>> wrote:
>> >
>> > Sure, here is briefly what I'm doing:
>> >
>> >     bool append = false;
>> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
>> >     auto arrowResult = arrow::io::FileOutputStream::Open(fileName,
>> append);
>> >     arrowStream = arrowResult.ValueOrDie();
>> >
>> >     std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
>> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
>> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
>> >
>> >     std::shared_ptr<arrow::Schema> arrowSchema = attributes2ArrowSchema(
>> >             inputSchema, settings.isAttsOnly());
>> >     ARROW_RETURN_NOT_OK(
>> >             arrow::ipc::RecordBatchStreamWriter::Open(
>> >                 arrowStream.get(), arrowSchema, &arrowWriter));
>> >
>> >     // Setup "arrowReader" using BufferReader and
>> RecordBatchStreamReader
>> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
>> >     ARROW_RETURN_NOT_OK(
>> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
>> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
>> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
>> >
>> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <[email protected]>
>> wrote:
>> >
>> > > Can you show the code you are writing? The first thing the stream
>> writer
>> > > does before writing any record batch is write the schema. It sounds
>> like
>> > > you are using arrow::ipc::WriteRecordBatch somewhere.
>> > >
>> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <[email protected]>
>> wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I have a RecordBatch that I would like to write to a file. I'm using
>> > > > FileOutputStream::Open to open the file and
>> RecordBatchStreamWriter::Open
>> > > > to open the stream. I write a record batch with WriteRecordBatch.
>> > > Finally,
>> > > > I close the RecordBatchWriter and OutputStream.
>> > > >
>> > > > The resulting file size is exactly the size of the Buffer used to
>> store
>> > > the
>> > > > RecordBatch. It looks like it is missing the schema. When I try to
>> open
>> > > the
>> > > > resulting file from PyArrow I get:
>> > > >
>> > > > >>> pa.ipc.open_file('/tmp/1')
>> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
>> > > >
>> > > > $ ll /tmp/1
>> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
>> > > >
>> > > > How can I write the schema as well?
>> > > >
>> > > > I was browsing the documentation at
>> > > > https://arrow.apache.org/docs/cpp/index.html but I can't locate
>> any C++
>> > > > documentation about RecordBatchStreamWriter or RecordBatchWriter.
>> Is this
>> > > > intentional?
>> > > >
>> > > > Thank you!
>> > > > Rares
>> > > >
>> > >
>>
>

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to