[
https://issues.apache.org/jira/browse/HADOOP-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643760#comment-17643760
]
ASF GitHub Bot commented on HADOOP-18546:
-----------------------------------------
pranavsaxena-microsoft commented on PR #5176:
URL: https://github.com/apache/hadoop/pull/5176#issuecomment-1339051008
> sorry, should have been clearer: a local spark build and spark-shell
process is ideal for replication and validation -as all splits are processed in
different worker threads in that process, it recreates the exact failure mode.
>
> script you can take and tune for your system; uses the mkcsv command in
cloudstore JAR.
>
> I am going to add this as a scalatest suite in the same module
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
Thanks for the script. I had applied following changes on the script:
https://github.com/pranavsaxena-microsoft/cloud-integration/commit/1d779f22150be3102635819e4525967573602dd9.
On trunk's jar, got exception:
```
22/12/05 23:51:27 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5)
java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "rowId")
- root class:
"$line85.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.CsvRecord"
If the schema is inferred from a Scala tuple/case class, or a Java bean,
please try to use scala.Option[_] or other nullable types (e.g.
java.lang.Integer instead of int/scala.Int).
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_0_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1001)
at
org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1001)
at
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2302)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
Using the jar of the PR's code:
```
minimums=((action_http_get_request.min=-1)
(action_http_get_request.failures.min=-1));
maximums=((action_http_get_request.max=-1)
(action_http_get_request.failures.max=-1));
means=((action_http_get_request.failures.mean=(samples=0, sum=0,
mean=0.0000)) (action_http_get_request.mean=(samples=0, sum=0, mean=0.0000)));
}}
22/12/06 01:04:22 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID
9) in 14727 ms on snvijaya-Virtual-Machine.mshome.net (executor driver) (9/9)
22/12/06 01:04:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
22/12/06 01:04:22 INFO DAGScheduler: ResultStage 1 (foreach at
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46)
finished in 115.333 s
22/12/06 01:04:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential
speculative or zombie tasks for this job
22/12/06 01:04:22 INFO TaskSchedulerImpl: Killing all running tasks in stage
1: Stage finished
22/12/06 01:04:22 INFO DAGScheduler: Job 1 finished: foreach at
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46,
took 115.337621 s
res35: String = validation completed [start: string, rowId: bigint ... 6
more fields]
```
Commands executed:
```
:load
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
validateDS(rowsDS)
```
> disable purging list of in progress reads in abfs stream closed
> ---------------------------------------------------------------
>
> Key: HADOOP-18546
> URL: https://issues.apache.org/jira/browse/HADOOP-18546
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.3.4
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
>
> turn off the prune of in progress reads in
> ReadBufferManager::purgeBuffersForStream
> this will ensure active prefetches for a closed stream complete. they wiill
> then get to the completed list and hang around until evicted by timeout, but
> at least prefetching will be safe.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]