Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not 
feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the 
overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually 
compress large values, but not after making defensive copies and attempting to 
encode them. In the end, it will be a lot more overhead, plus the work to make 
it possible. I think you'd be much better of compressing before storing in 
Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman 
<brian.bow...@sas.com<mailto:brian.bow...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
<brian.bow...@sas.com<mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
<rb...@netflix.com.INVALID<mailto:rb...@netflix.com.INVALID>> wrote:

    EXTERNAL

    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
<brian.bow...@sas.com<mailto:brian.bow...@sas.com>> wrote:

    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >


    --
    Ryan Blue
    Software Engineer
    Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Reply via email to