Rares Vernica created ARROW-1676:
------------------------------------
Summary: [C++] Featehr inserts 0 in the beginning and trims one
value at the end
Key: ARROW-1676
URL: https://issues.apache.org/jira/browse/ARROW-1676
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 0.7.1
Environment: libarrow-dev
Architecture: amd64
Version: 0.7.1-1
Python 2.7.13
>>> pyarrow.__version__
'0.7.1'
>>> feather.__version__
'0.4.0'
>>> pandas.__version__
u'0.20.3'
Reporter: Rares Vernica
An extra {{0}} appears in the beginning when serializing and deserializing an
array with more than {{128}} values and at least one {{NULL}} value using
{{Feather}}. Once the extra {{0}} is inserted a value is trimmed at the end.
Here is the C++ code to write such an array:
{code:java}
#include <iostream>
#include <arrow/api.h>
#include <arrow/io/file.h>
#include <arrow/ipc/feather.h>
#include <arrow/pretty_print.h>
int main() {
// 1. Build Array
arrow::DoubleBuilder builder;
for (int i = 0; i < 129; i++)
if (i == 0)
builder.AppendNull();
else
builder.Append(i);
std::shared_ptr<arrow::Array> array;
builder.Finish(&array);
arrow::PrettyPrint(*array, 0, &std::cout);
std::cout << std::endl;
// 2. Write to Feather file
std::shared_ptr<arrow::io::FileOutputStream> stream;
arrow::io::FileOutputStream::Open("out.f", false, &stream);
std::unique_ptr<arrow::ipc::feather::TableWriter> writer;
arrow::ipc::feather::TableWriter::Open(stream, &writer);
writer->SetNumRows(129);
writer->Append("id", *array);
writer->Finalize();
stream->Close();
return 0;
}
{code}
The output of running this code is:
{code:java}
# g++-4.9 -std=c++11 example.cpp -larrow && ./a.out
[null, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]
{code}
The array is deserialized in Python and looks like this:
{code:java}
>>> pandas.read_feather('out.f')
id
0 NaN
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
6 5.0
7 6.0
8 7.0
9 8.0
10 9.0
11 10.0
12 11.0
13 12.0
14 13.0
15 14.0
16 15.0
17 16.0
18 17.0
19 18.0
20 19.0
21 20.0
22 21.0
23 22.0
24 23.0
25 24.0
26 25.0
27 26.0
28 27.0
29 28.0
.. ...
99 98.0
100 99.0
101 100.0
102 101.0
103 102.0
104 103.0
105 104.0
106 105.0
107 106.0
108 107.0
109 108.0
110 109.0
111 110.0
112 111.0
113 112.0
114 113.0
115 114.0
116 115.0
117 116.0
118 117.0
119 118.0
120 119.0
121 120.0
122 121.0
123 122.0
124 123.0
125 124.0
126 125.0
127 126.0
128 127.0
[129 rows x 1 columns]
{code}
Notice the {{0.0}} value on index {{1}}. The value should have been {{1.0}}.
Also, the last value is {{127.0}} instead of {{128.0}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)