[ https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Yang updated HADOOP-13619: -------------------------------- Description: library used: org.apache.hadoop:hadoop-openstack:2.6.0 We are loading avro files from Oracle Storage Service server (i.e., Swift server) into Spark DataFrame object through the Spark Data Source API. For example: return hiveCtx.read().format("com.databricks.spark.avro").load(objectName); The number of records is less than the actual record count in the avro file when reading the avro file from Storage Service server using OpenStack Swift API. If we run a SQL on top of the returned data frome like "select count(\*) as C1 from <temp table>" we can see the record count is smaller when reading the same avro file from local file system. For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 records vs. 60855). From the log file we can see the reading os the file is splitted into 4: 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432 For a smaller avro file (wine.avro, 19M) the count sometimes is correct (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 10 times back-to-back produces the following record count results: run 1: 26999 run 2: 26999 run 3: 57076 run 4: 57056 run 5: 57076 run 6: 26999 run 7: 57076 run 8: 57076 run 9: 57076 run 10: 57076 For this wine.avro test case there are two splits: 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269 I will attach a zip file containing the smaller avro file in question and the debugged log file section of reading wine.avro file - one with successful reading(C4.ok) and one with missing record reading(C5.miss). was: library used: org.apache.hadoop:hadoop-openstack:2.6.0 We are loading avro files from Oracle Storage Service server (i.e., Swift server) into Spark DataFrame object through the Spark Data Source API. For example: return hiveCtx.read().format("com.databricks.spark.avro").load(objectName); The number of records is less than the actual record count in the avro file when reading the avro file from Storage Service server using OpenStack Swift API. If we run a SQL on top of the returned data frome like "select count(\*) as C1 from <temp table>" we can see the record count is smaller when reading the same avro file from local file system. For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 records vs. 60855). From the log file we can see the reading os the file is splitted into 4: 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432 For a smaller avro file (wine.avro, 19M) the count sometimes is correct (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 10 times back-to-back produces the following record count results: run 1: 26999 run 2: 26999 run 3: 57076 run 4: 57056 run 5: 57076 run 6: 26999 run 7: 57076 run 8: 57076 run 9: 57076 run 10: 57076 For this wine.avro test case there are two splits: 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269 I will attach a zip file containing the two avro files in question and the debugged log file section of reading wine.avro file - one with successful reading(C4.ok) and one with missing record reading(C5.miss). > missing data intermittently when reading avro file > -------------------------------------------------- > > Key: HADOOP-13619 > URL: https://issues.apache.org/jira/browse/HADOOP-13619 > Project: Hadoop Common > Issue Type: Bug > Components: fs/swift > Affects Versions: 2.6.0 > Environment: Linux EL6 > Reporter: Steve Yang > Priority: Blocker > > library used: org.apache.hadoop:hadoop-openstack:2.6.0 > We are loading avro files from Oracle Storage Service server (i.e., Swift > server) into Spark DataFrame object through the Spark Data Source API. For > example: > return hiveCtx.read().format("com.databricks.spark.avro").load(objectName); > The number of records is less than the actual record count in the avro file > when reading the avro file from Storage Service server using OpenStack Swift > API. > If we run a SQL on top of the returned data frome like "select count(\*) as > C1 from <temp table>" we can see the record count is smaller when reading the > same avro file from local file system. > For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 > records vs. 60855). From the log file we can see the reading os the file is > splitted into 4: > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432 > 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432 > For a smaller avro file (wine.avro, 19M) the count sometimes is correct > (57076 records) and sometimes wrong (26999 records). Run the same spark SQL > 10 times back-to-back produces the following record count results: > run 1: 26999 > run 2: 26999 > run 3: 57076 > run 4: 57056 > run 5: 57076 > run 6: 26999 > run 7: 57076 > run 8: 57076 > run 9: 57076 > run 10: 57076 > For this wine.avro test case there are two splits: > 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270 > 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: > swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269 > I will attach a zip file containing the smaller avro file in question and the > debugged log file section of reading wine.avro file - one with successful > reading(C4.ok) and one with missing record reading(C5.miss). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org