[
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062175#comment-15062175
]
Tim Preece commented on SPARK-12319:
------------------------------------
Hi Sean, Yin
I've started to ( and continue to ) investigate this DatasetAggregatorSuite
failure as described above.
So far I believe:
a) the description is incorrect and it has nothing to do with endianess or
BitSetMethods.java. (It just happens we see a failure on bigendian platforms -
see below)
b) the problem is probably in the codegen for unsaferow joins (
GenerateUnsafeRowJoiner ).
I see two Unsaferows being joined. A (string,int) + (string) which results in
an Unsaferow with schema (string,int,string).
When we come to update the offsets for the variable length data ( in this case
for the first String ) the offset is miscalculated.
( in updateOffset in GenerateUnsafeRowJoiner )
This means the int value in the second field slot is wrongly changed, and on a
BE platform (for this particular testcase) it is incremented by 8.
On a LE platform the value in the second field is also changed, but in a way
that does not alter the value of the int. However for both BE and LE platforms
the first String variable looks bogus with an invalid variable offset.
I'm continuing to investigate ( and so could well revise the above ), but
thought I would share my observations so far.
Also it would be useful if you happened to have a pointer to any design
documentation for unsaferow. For example I wasn't sure if all the variable
length data should go at the end of the row. That is the schema for the joined
row should actually have been (int,string,string).
Tim Preece
> Address endian specific problems surfaced in 1.6
> ------------------------------------------------
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0
> Environment: BE platforms
> Reporter: Adam Roberts
> Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed
> problems with DataFrames on BE platforms, e.g.
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we
> believe the issue lies within BitSetMethods.java, specifically around: return
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]