Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.

On Sun, Sep 13, 2020 at 10:48 PM Chang Chen <baibaic...@gmail.com> wrote:
>
> I think we can copy all encoded data into a ByteBuffer once, and unpack 
> values in the loop
>
>  while (valueIndex < this.currentCount) {
>     // values are bit packed 8 at a time, so reading bitWidth will always work
>     this.packer.unpack8Values(buffer, buffer.position() + valueIndex, 
> this.currentBuffer, valueIndex);
>     valueIndex += 8;
>   }
>
> Sean Owen <sro...@gmail.com> 于2020年9月14日周一 上午10:40写道:
>>
>> It certainly can't be called once - it's reading different data each time.
>> There might be a faster way to do it, I don't know. Do you have ideas?
>>
>> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <baibaic...@gmail.com> wrote:
>> >
>> > Hi export
>> >
>> > it looks like there is a hot spot in 
>> > VectorizedRleValuesReader#readNextGroup()
>> >
>> > case PACKED:
>> >   int numGroups = header >>> 1;
>> >   this.currentCount = numGroups * 8;
>> >
>> >   if (this.currentBuffer.length < this.currentCount) {
>> >     this.currentBuffer = new int[this.currentCount];
>> >   }
>> >   currentBufferIdx = 0;
>> >   int valueIndex = 0;
>> >   while (valueIndex < this.currentCount) {
>> >     // values are bit packed 8 at a time, so reading bitWidth will always 
>> > work
>> >     ByteBuffer buffer = in.slice(bitWidth);
>> >     this.packer.unpack8Values(buffer, buffer.position(), 
>> > this.currentBuffer, valueIndex);
>> >     valueIndex += 8;
>> >   }
>> >
>> >
>> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , 
>> > why we can't call slice out of the loop?

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to