[ 
https://issues.apache.org/jira/browse/DRILL-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010116#comment-15010116
 ] 

ASF GitHub Bot commented on DRILL-4056:
---------------------------------------

GitHub user jaltekruse opened a pull request:

    https://github.com/apache/drill/pull/266

    DRILL-4056: Avro corruption bug with UTF-8 strings

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jaltekruse/incubator-drill 
4056-avro-corruption-bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #266
    
----
commit a3e0cbe3820a0350d58c59f374877a12184850e0
Author: Jason Altekruse <[email protected]>
Date:   2015-11-13T23:46:58Z

    DRILL-4056: Fix corruption bug reading string data out of Avro

commit 44460fd5a72d6a61b232c335bb8beaaff9daad87
Author: Jason Altekruse <[email protected]>
Date:   2015-11-14T00:26:33Z

    DRILL-4056: Part 2 - Cleanup in Avro reader.
    
    Removed use of unnecessary Holder objects. Added restriction on batch
    size produced by a single call to next. Did not get a chance to confirm
    but it looks like it was reading an entire file into a single batch,
    which could have serious performance impacts on very large files.

commit dc084c1255a59aead865e641f952e9e162d4c5e5
Author: Jason Altekruse <[email protected]>
Date:   2015-11-17T23:42:44Z

    DRILL-4056: Part 3 - Adding results verification to avro tests.
    
    Task to be finished as part of DRILL-4110.

----


> Avro deserialization corrupts data
> ----------------------------------
>
>                 Key: DRILL-4056
>                 URL: https://issues.apache.org/jira/browse/DRILL-4056
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Other
>    Affects Versions: 1.3.0
>         Environment: Ubuntu 15.04 - Oracle Java
>            Reporter: Stefán Baxter
>            Assignee: Jason Altekruse
>             Fix For: 1.3.0
>
>         Attachments: test.zip
>
>
> I have an Avro file that support the following data/schema:
> {"field":"some", "classification":{"variant":"Gæst"}}
> When I select 10 rows from this file I get:
> +---------------------+
> |       EXPR$0        |
> +---------------------+
> | Gæst                |
> | Voksen              |
> | Voksen              |
> | Invitation KIF KBH  |
> | Invitation KIF KBH  |
> | Ordinarie pris KBH  |
> | Ordinarie pris KBH  |
> | Biljetter 200 krBH  |
> | Biljetter 200 krBH  |
> | Biljetter 200 krBH  |
> +---------------------+
> The bug is that the field values are incorrectly de-serialized and the value 
> from the previous row is retained if the subsequent row is shorter.
> The sql query:
> "select s.classification.variant variant from dfs.<some> as s limit 10;"
> That way the  "Ordinarie pris" becomes "Ordinarie pris KBH" because the 
> previous row had the value "Invitation KIF KBH".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to