hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
>
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <brian.bow...@sas.com> wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large should go
> > in separate files, not in a Parquet file where it would destroy any benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <brian.bow...@sas.com> wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
> >>
> >>     EXTERNAL
> >>
> >>     Hi Brian,
> >>
> >>     This seems like something we should allow. What imposes the current
> >> limit?
> >>     Is it in the thrift format, or just the implementations?
> >>
> >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <brian.bow...@sas.com>
> >> wrote:
> >>
> >>     > All,
> >>     >
> >>     > SAS requires support for storing varying-length character and
> >> binary blobs
> >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >>     > a unint32_t.   Looks this the will require incrementing the Parquet
> >> file
> >>     > format version and changing ByteArray len to uint64_t.
> >>     >
> >>     > Have there been any requests for this or other Parquet developments
> >> that
> >>     > require file format versioning changes?
> >>     >
> >>     > I realize this a non-trivial ask.  Thanks for considering it.
> >>     >
> >>     > -Brian
> >>     >
> >>
> >>
> >>     --
> >>     Ryan Blue
> >>     Software Engineer
> >>     Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Reply via email to