[jira] [Updated] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

Yulei Li (JIRA) Wed, 21 Sep 2016 23:26:31 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yulei Li updated HADOOP-13619:
------------------------------
    Assignee:     (was: Yulei Li)

> missing data intermittently when reading avro file in Spark from Swift storage
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-13619
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13619
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/swift
>    Affects Versions: 2.6.0
>         Environment: Linux EL6
>            Reporter: Steve Yang
>            Priority: Blocker
>
> library used: org.apache.hadoop:hadoop-openstack:2.6.0
> We are loading avro files from Oracle Storage Service server (i.e., Swift 
> server) into Spark DataFrame object through the Spark Data Source API. For 
> example:
> return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);
> The number of records is less than the actual record count in the avro file 
> when reading the avro file from Storage Service server using OpenStack Swift 
> API.
> If we run a SQL on top of the returned data frome like "select count(\*) as 
> C1 from <temp table>" we can see the record count is smaller when reading the 
> same avro file from local file system.
> For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 
> records vs. 60855). From the log file we can see the reading os the file is 
> splitted into 4:
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432
> For a smaller avro file (wine.avro, 19M) the count sometimes is correct 
> (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 
> 10 times back-to-back produces the following record count results:
> run 1: 26999
> run 2: 26999
> run 3: 57076
> run 4: 57056
> run 5: 57076
> run 6: 26999
> run 7: 57076
> run 8: 57076
> run 9: 57076
> run 10: 57076
> For this wine.avro test case there are two splits:
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269
> I will attach a zip file containing the smaller avro file in question and the 
> debugged log file section of reading wine.avro file - one with successful 
> reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

Reply via email to