Let me share the Ipython notebook.
On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta
wrote:
> Hi,
>
> I think that the notebook clearly demonstrates that setting the
> inferTimestamp option to False does not really help.
>
> Is it really impossible for you to show how your own data can be loaded?
Hi,
I think that the notebook clearly demonstrates that setting the
inferTimestamp option to False does not really help.
Is it really impossible for you to show how your own data can be loaded? It
should be simple, just open the notebook and see why the exact code you
have given does not work, an
Hi Gourav,
Please check the comments of the ticket, looks like the performance degradation
is attributed to inferTimestamp option that is true by default (I have no idea
why) in Spark 3.0. This forces Spark to scan entire text and so the poor
performance.
Regards
Sanjeev
> On Jun 30, 2020, at
Hi, Sanjeev,
I think that I did precisely that, can you please download my ipython
notebook and have a look, and let me know where I am going wrong. its
attached with the JIRA ticket.
Regards,
Gourav Sengupta
On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra
wrote:
> There are total 11 files as
There are total 11 files as part of tar. You will have to untar it to get to
actual files (.json.gz)
No, I am getting
Count: 33447
spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more
fields]
scala>
Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA
tickets is it correct?
Regards,
Gourav Sengupta
On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra
wrote:
> There is not much code, I am just using spark-shell and reading the data
> like so
>
> spark.time(spar
There is not much code, I am just using spark-shell and reading the data like so
spark.time(spark.read.json("/data/small-anon/"))
> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta
> wrote:
>
> Hi Sanjeev,
>
> can you share the exact code that you are using to read the JSON files?
> Currently I
Done. https://issues.apache.org/jira/browse/SPARK-32130
On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk
wrote:
> Hello Sanjeev,
>
> It is hard to troubleshoot the issue without input files. Could you open
> an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and
> attach the JSON files
Hello Sanjeev,
It is hard to troubleshoot the issue without input files. Could you open an
JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the
JSON files there (or samples or code which generates JSON files)?
Maxim Gekk
Software Engineer
Databricks, Inc.
On Mon, Jun 29
It has read everything. As you notice the timing of count is still smaller
in Spark 2.4
Spark 2.4
scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
more fields]
scala> spark.time(res61.count())
Tim
Could you share your code? Are you sure you Spark 2.4 cluster had
indeed read anything? Looks like the Input size field is empty under 2.4.
-- ND
On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
I have large amount of json files that Spark can read in 36 seconds
but Spark 3.0 takes almost 33 minu
There is not much code, I am using spark-shell provided by Spark 2.4 and
Spark 3.
val dp = spark.read.json("/Users//data/dailyparams/20200528")
On Mon, Jun 29, 2020 at 2:25 AM Gourav Sengupta
wrote:
> Hi,
>
> can you please share the SPARK code?
>
>
>
> Regards,
> Gourav
>
> On Sun, Jun 28, 2
Hi,
can you please share the SPARK code?
Regards,
Gourav
On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra
wrote:
>
> I have large amount of json files that Spark can read in 36 seconds but
> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
> looks like Spark 3.0 is choo
I have large amount of json files that Spark can read in 36 seconds but
Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
have any idea what is going on? Is there any configuration problem with
Spark 3.
14 matches
Mail list logo