Best case is use dataframe and df.columns will automatically give you
column names. Are you sure your file is indeed in csv? maybe it is easier
if you share the code?
On Wed, 24 Mar 2021 at 2:12 pm, Sean Owen wrote:
> It would split 10GB of CSV into multiple partitions by default, unless
> it's
It would split 10GB of CSV into multiple partitions by default, unless it's
gzipped. Something else is going on here.
On Tue, Mar 23, 2021 at 10:04 PM "Yuri Oleynikov (יורי אולייניקוב)" <
yur...@gmail.com> wrote:
> I’m not Spark core developer and do not want to confuse you but it seems
>
I’m not Spark core developer and do not want to confuse you but it seems
logical to me that just reading from single file (no matter what format of the
file is used) gives no parallelism unless you do repartition by some column
just after csv load, but the if you’re telling you’ve already tried
I don't think that would change partitioning? try .repartition(). It isn't
necessary to write it out let alone in Avro.
On Tue, Mar 23, 2021 at 8:45 PM "Yuri Oleynikov (יורי אולייניקוב)" <
yur...@gmail.com> wrote:
> Hi, Mohammed
> I think that the reason that only one executor is running
So spark by default doesn’t split the large 10gb file when loaded?
Sent from my iPhone
> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (יורי אולייניקוב)
> wrote:
>
> Hi, Mohammed
> I think that the reason that only one executor is running and have single
> partition is because you have si
Hi, Mohammed
I think that the reason that only one executor is running and have single
partition is because you have single file that might be read/loaded into memory.
In order to achieve better parallelism I’d suggest to split the csv file.
Another problem is question: why are you using rdd?
J
Hi,
I have 10gb file that should be loaded into spark dataframe. This file is csv
with header and we were using rdd.zipwithindex to get column names and convert
to avro accordingly.
I am assuming this is taking long time and only executor runs and never
achieves parallelism. Is there a easy w
Hi,
I just posted some stuff regarding using Spark with Oracle, If you want to
do distributed processing like any DW of your choice be Oracle , Hive or
BigQuery, best in my experience to create Spark dataframes on top of the
underlying storage.either through JDBC or Spark API (Hive or BigQuery).
Hi,
I did some investigation on this and created a dataframe on top of the
underlying view in Oracle database.
Let assume that our oracle view is just a normal view as opposed to
materialized view, something like below where both sales and costs are FACT
tables
CREATE OR REPLACE FORCE EDITIONABL
Hi Team,
I am facing this issue again.
I am using Spark 3.0.1 with Python.
Could you please suggest why it says the below error:
*Current Committed Offsets: {KafkaV2[Subscribe[my-topic]]:
{“my-topic”:{“1":1498,“0”:1410}}}Current Available Offsets:
{KafkaV2[Subscribe[my-topic]]: {“my-topic”:{“1
Hi Team
I am looking to learn apache spark and to do certification.
I am new to spark framework. Kindly help with guidelines and complete
details to proceed.
Thanks
Kishore Kumar
I have been developing 'Spark on Oracle', a project to provide better
integration of Spark into an Oracle Data Warehouse. You can read about it
at https://hbutani.github.io/spark-on-oracle/blog/Spark_on_Oracle_Blog.html
The key features are Catalog Integration, translation and pushdown of Spark
SQ
Hey!
I don't think you can do selectively removals, never heard of it but who
knows..
You can refer here to see all the available options ->
https://spark.apache.org/docs/latest/monitoring.html .
In my experience having 4 days worth of logs is enough, usually if
something fails you check it righ
13 matches
Mail list logo