[ 
https://issues.apache.org/jira/browse/AVRO-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591135#comment-13591135
 ] 

Doug Cutting commented on AVRO-1259:
------------------------------------

For a union like ["null", "int"], Trevni writes two columns, one for null 
values and one for the ints.  Each is written as a sequence of <length><value> 
pairs, where the length is always either zero or one.  With this optimization, 
if most values are null, then the int column will be compactly written as 
<rowsToSkip><intValue><rowsToSkip><intValue>..., but the null column will still 
be written as <isNull><isNull>...  (Since the nulls themselves have no size, 
the column just records which rows have a null value for the union and which do 
not.)  This could be improved if we also run-length encoded ones.  We might, 
e.g., even negative values for zero runs and odd negative values for one runs.  
Then a the null column for a sparse value would look like 
<nullCount><nullCount>...  Similarly, if a ["null", "int"] nearly always has a 
non-null value then the null column would look like <rowsToSkip><rowsToSkip>... 
while the int column would have <valueCount><int><int>...  In general, 
optimizing zeros and ones both would greatly improve the representations of 
unions.
                
> improve Trevni encoding of sparse fields
> ----------------------------------------
>
>                 Key: AVRO-1259
>                 URL: https://issues.apache.org/jira/browse/AVRO-1259
>             Project: Avro
>          Issue Type: Improvement
>          Components: trevni
>            Reporter: Doug Cutting
>         Attachments: AVRO-1259.patch
>
>
> If in most records a field is null, Trevni writes a null byte (zero length) 
> for that record in that column.  This might be optimized by instead using a 
> run-length encoding for lengths.  The length is signed, so negative lengths 
> might be used to indicate the number of zero-lengths before the next non-zero 
> value.  This could thus be back-compatible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to