[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718784#comment-16718784 ]
Gabor Somogyi commented on SPARK-25136: --------------------------------------- As I see from the logs you've chosen the first way but didn't set query name. The mentioned scenario generates the following checkpoint location: spark.conf.set("spark.sql.streaming.checkpointLocation...) + <random UUID>. Can you either set query name or use the option on writer? > unable to use HDFS checkpoint directories after driver restart > -------------------------------------------------------------- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.3.0 > Reporter: Robert Reid > Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} > > Worker log entries > {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance > StateStoreProviderId(StateStoreId(hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state,0,0,default),b7ee0163-47db-4392-ab66-94d36ce63074) > is active}} > {{java.lang.IllegalStateException: Error reading delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: java.io.FileNotFoundException: File does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org