[
https://issues.apache.org/jira/browse/DRILL-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392076#comment-14392076
]
Jack Crawford commented on DRILL-2616:
--------------------------------------
When i query through drill, it seems certain strings from some rows are
repeated far more often then they appear in the original data. An example query
for the first 5 rows shows this under the 'indicator' column. If you look
further through the select*, the id column shows it as well, where drill comes
back with ~3 or so unique ids, but the actual data source has many more.
query:
select * from dfs.`indicators.parquet` limit 5;
+------------+------------+------------+------------+
| id | timeNanos | indicator | value |
+------------+------------+------------+------------+
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 |
distNear | -0.0 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 |
distNear | -4.0612379933691045E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 |
distNear | -0.0 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 |
distNear | -2.6080420511220836E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555461205550000 |
distNear | -0.0 |
+------------+------------+------------+------------+
expected output (verified by loading in spark):
id timeNanos indicator
value
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555457827764000 distNear
-0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555457827764000 smartDiff
-0.000406
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555458137319000 distNear
-0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555458137319000 smartDiff
-0.000261
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555461205550000 distNear
-0.000000
> strings loaded incorrectly from parquet files
> ---------------------------------------------
>
> Key: DRILL-2616
> URL: https://issues.apache.org/jira/browse/DRILL-2616
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jack Crawford
> Assignee: Jason Altekruse
> Priority: Critical
> Labels: parquet
>
> When loading string columns from parquet data sources, some rows have their
> string values replaced with the value from other rows.
> Example parquet for which the problem occurs:
> https://drive.google.com/file/d/0B2JGBdceNMxdeFlJcW1FUElOdXc/view?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)