You might try with the Spark 2.0 preview.  We spent a bunch of time
improving the handling of many small files.

On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda <khaled.hammo...@kik.com>
wrote:

> I'm trying to use Spark SQL to load json data that are split across about
> 70k
> files across 24 directories in hdfs, using
> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>
> This doesn't seem to work for some reason, I get timeout errors like the
> following:
>
> -------
> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
> 120000
> ms while there are outstanding requests. Assuming connection is dead;
> please
> adjust spark.network.timeout if this is wrong.
> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
> outstanding when connection from
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
> ...
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> ------
>
> I don't want to start tinkering with increasing timeouts yet. I tried to
> load just one sub-directory, which contains around 4k files, and this seems
> to work fine. So I thought of writing a loop where I load the json files
> from each sub-dir and then unionAll the current dataframe with the previous
> dataframe. However, this also fails because apparently the json files don't
> have the exact same schema, causing this error:
>
> ---
> Traceback (most recent call last):
>   File "/home/hadoop/load_json.py", line 65, in <module>
>     df = df.unionAll(hrdf)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 998, in unionAll
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
> 51, in deco
> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
> ---
>
> I'd like to know what's preventing Spark from loading 70k files the same
> way
> it's loading 4k files?
>
> To give you some idea about my setup and data:
> - ~70k files across 24 directories in HDFS
> - Each directory contains 3k files on average
> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
> available to YARN
> - Spark 1.6.1
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to