On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <rvern...@gmail.com> wrote: > > I was able to reproduce my issue in a small, fully-contained, program. Here > is the source code: > > #include <arrow/builder.h> > #include <arrow/io/file.h> > #include <arrow/ipc/writer.h> > #include <arrow/record_batch.h> > > arrow::Status foo() { > std::shared_ptr<arrow::io::OutputStream> arrowStream; > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > std::shared_ptr<arrow::RecordBatch> arrowBatch; > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > > std::vector<std::shared_ptr<arrow::Field>> arrowFields(2); > arrowFields[0] = arrow::field("foo", arrow::int64()); > arrowFields[1] = arrow::field("bar", arrow::int64()); > std::shared_ptr<arrow::Schema> arrowSchema = arrow::schema(arrowFields); > > std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2); > arrow::Int64Builder arrowBuilder; > for (int i = 0; i < 2; i++) { > for (int j = 0; j < 21; j++) > if (i && (j % 2)) > arrowBuilder.AppendNull(); > else > arrowBuilder.Append(j); > ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i])); > } > arrowBatch = arrow::RecordBatch::Make(arrowSchema, > arrowArrays[0]->length(), arrowArrays); > > ARROW_ASSIGN_OR_RAISE(arrowStream, > arrow::io::FileOutputStream::Open("/tmp/foo")); > ARROW_ASSIGN_OR_RAISE(arrowWriter, > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema)); > ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch)); > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > ARROW_RETURN_NOT_OK(arrowStream->Close()); > > return arrow::Status::OK(); > } > > int main() { > foo(); > } > > I compile and run it like this: > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo > > The file is small and I can't read it from PyArrow: > > > python -c "import pyarrow; > pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
Here is your problem. Try `pyarrow.ipc.open_stream`. > Traceback (most recent call last): > File "<string>", line 1, in <module> > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 156, > in open_file > return RecordBatchFileReader(source, footer_offset=footer_offset) > File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 99, in > __init__ > self._open(source, footer_offset=footer_offset) > File "pyarrow/ipc.pxi", line 398, in > pyarrow.lib._RecordBatchFileReader._open > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: File is too small: 8 > > Here is the Arrow and G++ version: > > > dpkg -s libarrow-dev > Package: libarrow-dev > Status: install ok installed > Priority: optional > Section: libdevel > Installed-Size: 38738 > Maintainer: Apache Arrow Developers <dev@arrow.apache.org> > Architecture: amd64 > Multi-Arch: same > Source: apache-arrow > Version: 0.17.1-1 > Depends: libarrow17 (= 0.17.1-1) > > > g++ --version > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > Does this make sense? > > Cheers, > Rares > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <rvern...@gmail.com> wrote: > > > This is the compiler: > > > > > g++ --version > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 > > > > And this is how I compile the code: > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include" > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\"" > > -I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include" > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared > > -Wl,-soname,libaccelerated_io_tools.so -L. > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib" > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI still > > has PyArrow binaries for 2.7. > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the same. I > > also fixed all the deprecation warnings but that did not help either. > > > > Setting a breakpoint might be a challenge since this code runs as a > > plug-in, but I'll try to isolate this further. > > > > Thanks! > > Rares > > > > > > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > >> What compiler are you using? > >> > >> In 0.16.0 (what you said you were targeting, though it would be better > >> for you to upgrade to 0.17.1) schema is written in the CheckStarted > >> function here > >> > >> > >> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972 > >> > >> Status CheckStarted() { > >> if (!started_) { > >> return Start(); > >> } > >> return Status::OK(); > >> } > >> > >> started_ is set to false by a default member initializer in the > >> protected block. Maybe you should set a breakpoint in this function > >> and see if for some reason started_ is true on the first invocation > >> (in which case it makes me wonder if there is something > >> not-fully-C++11-compliant about your toolchain). > >> > >> Otherwise I'm a bit stumped since there are lots of production > >> applications that use this code. > >> > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <rvern...@gmail.com> > >> wrote: > >> > > >> > Sure, here is briefly what I'm doing: > >> > > >> > bool append = false; > >> > std::shared_ptr<arrow::io::OutputStream> arrowStream; > >> > auto arrowResult = arrow::io::FileOutputStream::Open(fileName, > >> append); > >> > arrowStream = arrowResult.ValueOrDie(); > >> > > >> > std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter; > >> > std::shared_ptr<arrow::RecordBatch> arrowBatch; > >> > std::shared_ptr<arrow::RecordBatchReader> arrowReader; > >> > > >> > std::shared_ptr<arrow::Schema> arrowSchema = attributes2ArrowSchema( > >> > inputSchema, settings.isAttsOnly()); > >> > ARROW_RETURN_NOT_OK( > >> > arrow::ipc::RecordBatchStreamWriter::Open( > >> > arrowStream.get(), arrowSchema, &arrowWriter)); > >> > > >> > // Setup "arrowReader" using BufferReader and > >> RecordBatchStreamReader > >> > ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch)); > >> > ARROW_RETURN_NOT_OK( > >> > arrowWriter->WriteRecordBatch(*arrowBatch)); > >> > ARROW_RETURN_NOT_OK(arrowWriter->Close()); > >> > ARROW_RETURN_NOT_OK(arrowStream->Close()); > >> > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <wesmck...@gmail.com> > >> wrote: > >> > > >> > > Can you show the code you are writing? The first thing the stream > >> writer > >> > > does before writing any record batch is write the schema. It sounds > >> like > >> > > you are using arrow::ipc::WriteRecordBatch somewhere. > >> > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <rvern...@gmail.com> > >> wrote: > >> > > > >> > > > Hello, > >> > > > > >> > > > I have a RecordBatch that I would like to write to a file. I'm using > >> > > > FileOutputStream::Open to open the file and > >> RecordBatchStreamWriter::Open > >> > > > to open the stream. I write a record batch with WriteRecordBatch. > >> > > Finally, > >> > > > I close the RecordBatchWriter and OutputStream. > >> > > > > >> > > > The resulting file size is exactly the size of the Buffer used to > >> store > >> > > the > >> > > > RecordBatch. It looks like it is missing the schema. When I try to > >> open > >> > > the > >> > > > resulting file from PyArrow I get: > >> > > > > >> > > > >>> pa.ipc.open_file('/tmp/1') > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6 > >> > > > > >> > > > $ ll /tmp/1 > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1 > >> > > > > >> > > > How can I write the schema as well? > >> > > > > >> > > > I was browsing the documentation at > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't locate > >> any C++ > >> > > > documentation about RecordBatchStreamWriter or RecordBatchWriter. > >> Is this > >> > > > intentional? > >> > > > > >> > > > Thank you! > >> > > > Rares > >> > > > > >> > > > >> > >