[
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871885#comment-16871885
]
Hyukjin Kwon commented on SPARK-27966:
--------------------------------------
While debugging this, I found one case not working:
{code}
from pyspark.sql.functions import udf, input_file_name
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"),
input_file_name()).show()
{code}
{code}
+------------+-----------------+
|<lambda>(id)|input_file_name()|
+------------+-----------------+
| 8| |
| 5| |
| 0| |
| 9| |
| 6| |
| 2| |
| 3| |
| 4| |
| 7| |
| 1| |
+------------+-----------------+
{code}
But the reproducer described here works. Is this same issue or different issue,
[~Chr_96er]?
> input_file_name empty when listing files in parallel
> ----------------------------------------------------
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Reporter: Christian Homberg
> Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>
> {code:java}
> +-----------------+
> |input_file_name()|
> +-----------------+
> | |
> | |
> | |
> | |
> | |
> +-----------------+
> {code}
> My environment is databricks and debugging the Log4j output showed me that
> the issue occurred when the files are being listed in parallel, e.g. when
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories
> in parallel under:{code}
>
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and
> directories. Size of Paths: 0; threshold: 32
> {code}
>
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999
> resolves the issue for me.
>
> *edit: the problem is not exclusively linked to listing files in parallel.
> I've setup a larger cluster for which after parallel file listing the
> input_file_name did return the correct filename. After inspecting the log4j
> again, I assume that it's linked to some kind of MetaStore being full. I've
> attached a section of the log4j output that I think should indicate why it's
> failing. If you need more, please let me know.*
> **
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]