Steve Yang created HADOOP-13619:
-----------------------------------

             Summary: missing data when reading avro file
                 Key: HADOOP-13619
                 URL: https://issues.apache.org/jira/browse/HADOOP-13619
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs/swift
    Affects Versions: 2.6.0
         Environment: Linux EL6
            Reporter: Steve Yang


library used: org.apache.hadoop:hadoop-openstack:2.6.0
We are loading avro files from Oracle Storage Service server (i.e., Swift 
server) into Spark DataFrame object through the Spark Data Source API. For 
example:
return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);

The number of records is less than the actual record count in the avro file 
when reading the avro file from Storage Service server using OpenStack Swift 
API.

If we run a SQL on top of the returned data frome like "select count(*) as C1 
from <temp table>" we can see the record count is smaller when reading the same 
avro file from local file system.

For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 
records vs. 60855). From the log file we can see the reading os the file is 
splitted into 4:
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432

For a smaller avro file (wine.avro, 19M) the count sometimes is correct (57076 
records) and sometimes wrong (26999 records). Run the same spark SQL 10 times 
back-to-back produces the following record count results:
run 1: 26999
run 2: 26999
run 3: 57076
run 4: 57056
run 5: 57076
run 6: 26999
run 7: 57076
run 8: 57076
run 9: 57076
run 10: 57076

For this wine.avro test case there are two splits:
2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269

I will attach a zip file containing the two avro files in question and the 
debugged log file section of reading wine.avro file - one with successful 
reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to