[jira] [Commented] (HADOOP-18546) disable purging list of in progress reads in abfs stream closed

ASF GitHub Bot (Jira) Tue, 06 Dec 2022 01:51:27 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643760#comment-17643760
 ]


ASF GitHub Bot commented on HADOOP-18546:
-----------------------------------------

pranavsaxena-microsoft commented on PR #5176:
URL: https://github.com/apache/hadoop/pull/5176#issuecomment-1339051008

   > sorry, should have been clearer: a local spark build and spark-shell 
process is ideal for replication and validation -as all splits are processed in 
different worker threads in that process, it recreates the exact failure mode.
   > 
   > script you can take and tune for your system; uses the mkcsv command in 
cloudstore JAR.
   > 
   > I am going to add this as a scalatest suite in the same module 
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
   
   Thanks for the script. I had applied following changes on the script: 
https://github.com/pranavsaxena-microsoft/cloud-integration/commit/1d779f22150be3102635819e4525967573602dd9.
   
   On trunk's jar, got exception:
   ```
   22/12/05 23:51:27 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5)
   java.lang.NullPointerException: Null value appeared in non-nullable field:
   - field (class: "scala.Long", name: "rowId")
   - root class: 
"$line85.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.CsvRecord"
   If the schema is inferred from a Scala tuple/case class, or a Java bean, 
please try to use scala.Option[_] or other nullable types (e.g. 
java.lang.Integer instead of int/scala.Int).
           at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_0_0$(Unknown
 Source)
           at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
 Source)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
           at scala.collection.Iterator.foreach(Iterator.scala:943)
           at scala.collection.Iterator.foreach$(Iterator.scala:943)
           at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
           at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1001)
           at 
org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1001)
           at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2302)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
           at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
           at org.apache.spark.scheduler.Task.run(Task.scala:139)
           at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   ```
   
   Using the jar of the PR's code:
   ```
   minimums=((action_http_get_request.min=-1) 
(action_http_get_request.failures.min=-1));
   maximums=((action_http_get_request.max=-1) 
(action_http_get_request.failures.max=-1));
   means=((action_http_get_request.failures.mean=(samples=0, sum=0, 
mean=0.0000)) (action_http_get_request.mean=(samples=0, sum=0, mean=0.0000)));
   }} 
   22/12/06 01:04:22 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 
9) in 14727 ms on snvijaya-Virtual-Machine.mshome.net (executor driver) (9/9)
   22/12/06 01:04:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks 
have all completed, from pool 
   22/12/06 01:04:22 INFO DAGScheduler: ResultStage 1 (foreach at 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46)
 finished in 115.333 s
   22/12/06 01:04:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential 
speculative or zombie tasks for this job
   22/12/06 01:04:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage finished
   22/12/06 01:04:22 INFO DAGScheduler: Job 1 finished: foreach at 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc:46,
 took 115.337621 s
   res35: String = validation completed [start: string, rowId: bigint ... 6 
more fields]
   ```
   
   Commands executed:
   ```
   :load 
/home/snvijaya/Desktop/cloud-integration/spark-cloud-integration/src/scripts/validating-csv-record-io.sc
   validateDS(rowsDS)
   ```




> disable purging list of in progress reads in abfs stream closed
> ---------------------------------------------------------------
>
>                 Key: HADOOP-18546
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18546
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.4
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> turn off the prune of in progress reads in 
> ReadBufferManager::purgeBuffersForStream
> this will ensure active prefetches for a closed stream complete. they wiill 
> then get to the completed list and hang around until evicted by timeout, but 
> at least prefetching will be safe.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18546) disable purging list of in progress reads in abfs stream closed

Reply via email to