[
https://issues.apache.org/jira/browse/AVRO-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591135#comment-13591135
]
Doug Cutting commented on AVRO-1259:
------------------------------------
For a union like ["null", "int"], Trevni writes two columns, one for null
values and one for the ints. Each is written as a sequence of <length><value>
pairs, where the length is always either zero or one. With this optimization,
if most values are null, then the int column will be compactly written as
<rowsToSkip><intValue><rowsToSkip><intValue>..., but the null column will still
be written as <isNull><isNull>... (Since the nulls themselves have no size,
the column just records which rows have a null value for the union and which do
not.) This could be improved if we also run-length encoded ones. We might,
e.g., even negative values for zero runs and odd negative values for one runs.
Then a the null column for a sparse value would look like
<nullCount><nullCount>... Similarly, if a ["null", "int"] nearly always has a
non-null value then the null column would look like <rowsToSkip><rowsToSkip>...
while the int column would have <valueCount><int><int>... In general,
optimizing zeros and ones both would greatly improve the representations of
unions.
> improve Trevni encoding of sparse fields
> ----------------------------------------
>
> Key: AVRO-1259
> URL: https://issues.apache.org/jira/browse/AVRO-1259
> Project: Avro
> Issue Type: Improvement
> Components: trevni
> Reporter: Doug Cutting
> Attachments: AVRO-1259.patch
>
>
> If in most records a field is null, Trevni writes a null byte (zero length)
> for that record in that column. This might be optimized by instead using a
> run-length encoding for lengths. The length is signed, so negative lengths
> might be used to indicate the number of zero-lengths before the next non-zero
> value. This could thus be back-compatible.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira