Hi Chetan,
You can have a static parquet file created, and when you
create a data frame you can pass the location of both the files, with
option mergeSchema true. This will always fetch you a dataframe even if the
original file is not present.
Kuchekar, Nilesh
On Sat, May 9
Hi,
Is there a way we can customize the partitioner for Dataset to be a
Hive Hash Partitioner rather than Murmur3 Partitioner.
Regards,
Kuchekar, Nilesh
Hi Gaurav,
You might want to look for Lambda Architecture with Spark.
https://www.youtube.com/watch?v=xHa7pA94DbA
Regards,
Kuchekar, Nilesh
On Thu, May 18, 2017 at 8:58 PM, Gaurav1809 wrote:
> Hello gurus,
>
> How exactly it works in real world scenarios when i
Hi,
I am running a spark job, which saves the computed data (massive data)
to S3. On the Spark Ui I see the some jobs are active, but no activity in
the logs. Also on S3 all the data has be written (verified each bucket -->
it has _SUCCESS file)
Am I missing something?
Thanks.
Kuche
,(y,index))
now reduce by key so and then sort the internal with the index value.
Thanks.
Kuchekar, Nilesh
On Tue, Jul 26, 2016 at 7:35 PM, janardhan shetty
wrote:
> Let me provide step wise details:
>
> 1.
> I have an RDD = {
> (ID2,18159) - *element 1 *
> (ID1,18159)
Stage tab of the Spark UI.
Kuchekar, Nilesh
On Tue, Jul 19, 2016 at 8:16 PM, Aaron Jackson wrote:
> Hi,
>
> I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a
> job that creates some 120 stages. Eventually, the active and pending
> stages reduce down to a small
tune spark
<http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/>,
cheatsheet for tuning spark <http://techsuppdiva.github.io/spark1.6.html>.
Hope this helps, keep the community posted what resolved your issue if it
does.
Thanks.
Kuchekar, Nilesh
On Sat, Feb
are setting.
Kuchekar, Nilesh
On Wed, Feb 17, 2016 at 8:02 PM, wrote:
> Hi All,
>
> I have been facing memory issues in spark. im using spark-sql on AWS EMR.
> i have around 50GB file in AWS S3. I want to read this file in BI tool
> connected to spark-sql on thrift-server over O
ad","4000")
conf = conf.set("spark.executor.cores","4").set("spark.executor.memory",
"15G").set("spark.executor.instances","6")
Is it also possible to use reduceBy in place of groupBy that might help the
shuffling too.
K