[jira] [Commented] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

Steve Loughran (JIRA) Mon, 19 Sep 2016 03:24:14 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503018#comment-15503018
 ]


Steve Loughran commented on HADOOP-13619:
-----------------------------------------


you aren't overwriting blobs with new ones of different sizes? that can cause 
visible consistency problem.

An other possible cause is that these are clearly multipart files, which are 
trouble in their own ways. Specifically, I've seen mismatches between the size 
returned in FileSystem.getFileStatus and the actual file length as returned in 
reads.

Can you replicate this locally? That is, repeatedly run getStatus() against the 
file and verify that it always returns the length of file just written

Now the bad news: you are probably going to have to be the one to identify and, 
if it is possible fix the problem. I would also recommend you work on on the 
branch-2 branch, because that and trunk is where fixes will go, maybe 
backported to 2.8 or 2.7.4.


I've been writing some spark & object store tests in SPARK-7481, albeit with a 
focus on S3; In HADOOP-11694 we've been adding scale tests for s3 too.

One limitation we have there is the lack of large public datasets, and the 
time/cost it takes to set up transient ones for a single test. Would you be 
able to serve up the specific file causing problems here as a public object? 
That way some of our integration tests could work with it direct, the way we do 
for some AWS tests today.


> missing data intermittently when reading avro file in Spark from Swift storage
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-13619
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13619
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/swift
>    Affects Versions: 2.6.0
>         Environment: Linux EL6
>            Reporter: Steve Yang
>            Priority: Blocker
>
> library used: org.apache.hadoop:hadoop-openstack:2.6.0
> We are loading avro files from Oracle Storage Service server (i.e., Swift 
> server) into Spark DataFrame object through the Spark Data Source API. For 
> example:
> return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);
> The number of records is less than the actual record count in the avro file 
> when reading the avro file from Storage Service server using OpenStack Swift 
> API.
> If we run a SQL on top of the returned data frome like "select count(\*) as 
> C1 from <temp table>" we can see the record count is smaller when reading the 
> same avro file from local file system.
> For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 
> records vs. 60855). From the log file we can see the reading os the file is 
> splitted into 4:
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
> 2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432
> For a smaller avro file (wine.avro, 19M) the count sometimes is correct 
> (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 
> 10 times back-to-back produces the following record count results:
> run 1: 26999
> run 2: 26999
> run 3: 57076
> run 4: 57056
> run 5: 57076
> run 6: 26999
> run 7: 57076
> run 8: 57076
> run 9: 57076
> run 10: 57076
> For this wine.avro test case there are two splits:
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
> 2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: 
> swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269
> I will attach a zip file containing the smaller avro file in question and the 
> debugged log file section of reading wine.avro file - one with successful 
> reading(C4.ok) and one with missing record reading(C5.miss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13619) missing data intermittently when reading avro file in Spark from Swift storage

Reply via email to