[ 
https://issues.apache.org/jira/browse/DRILL-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392076#comment-14392076
 ] 

Jack Crawford commented on DRILL-2616:
--------------------------------------

When i query through drill, it seems certain strings from some rows are 
repeated far more often then they appear in the original data. An example query 
for the first 5 rows shows this under the 'indicator' column. If you look 
further through the select*, the id column shows it as well, where drill comes 
back with ~3 or so unique ids, but the actual data source has many more.

query:
select * from dfs.`indicators.parquet` limit 5;

+------------+------------+------------+------------+
|     id     | timeNanos  | indicator  |   value    |
+------------+------------+------------+------------+
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | 
distNear   | -0.0       |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | 
distNear   | -4.0612379933691045E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | 
distNear   | -0.0       |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | 
distNear   | -2.6080420511220836E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555461205550000 | 
distNear   | -0.0       |
+------------+------------+------------+------------+

expected output (verified by loading in spark):
                                            id            timeNanos  indicator  
   value
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555457827764000   distNear 
-0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555457827764000  smartDiff 
-0.000406
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555458137319000   distNear 
-0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555458137319000  smartDiff 
-0.000261
generated-4458776b-4e22-415e-8fd9-29b687f40dce  1427555461205550000   distNear 
-0.000000

> strings loaded incorrectly from parquet files
> ---------------------------------------------
>
>                 Key: DRILL-2616
>                 URL: https://issues.apache.org/jira/browse/DRILL-2616
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jack Crawford
>            Assignee: Jason Altekruse
>            Priority: Critical
>              Labels: parquet
>
> When loading string columns from parquet data sources, some rows have their 
> string values replaced with the value from other rows.
> Example parquet for which the problem occurs:
> https://drive.google.com/file/d/0B2JGBdceNMxdeFlJcW1FUElOdXc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to