Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich,

Thank you. Ah, I want to avoid bringing all data to the driver node. That
is my understanding of what will happen in that case. Perhaps, I'll trigger
a Lambda to rename/combine the files after PySpark writes them.

Cheers,
Marco.

On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh 
wrote:

> you can try
>
> df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json")
>
> hdfs dfs -ls /tmp/pairs.json
> Found 2 items
> -rw-r--r--   3 hduser supergroup  0 2023-05-04 22:21
> /tmp/pairs.json/_SUCCESS
> -rw-r--r--   3 hduser supergroup 96 2023-05-04 22:21
> /tmp/pairs.json/part-0-21f12540-c1c6-441d-a9b2-a82ce2113853-c000.json
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 4 May 2023 at 22:14, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> Hi Mich,
>> Thank you.
>> Are you saying this satisfies my requirement?
>>
>> On the other hand, I am smelling something going on. Perhaps the Spark
>> 'part' files should not be thought of as files, but rather pieces of a
>> conceptual file. If that is true, then your approach (of which I'm well
>> aware) makes sense. Question: what are some good methods, tools, for
>> combining the parts into a single, well-named file? I imagine that is
>> outside of the scope of PySpark, but any advice is welcome.
>>
>> Thank you,
>> Marco.
>>
>> On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh 
>> wrote:
>>
>>> AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they
>>> do sharding to improve read performance when writing to HCFS file systems.
>>>
>>> Let us take your code for a drive
>>>
>>> import findspark
>>> findspark.init()
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import struct
>>> from pyspark.sql.types import *
>>> spark = SparkSession.builder \
>>> .getOrCreate()
>>> pairs = [(1, "a1"), (2, "a2"), (3, "a3")]
>>> Schema = StructType([ StructField("ID", IntegerType(), False),
>>>   StructField("datA" , StringType(), True)])
>>> df = spark.createDataFrame(data=pairs,schema=Schema)
>>> df.printSchema()
>>> df.show()
>>> df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct"))
>>> df2.printSchema()
>>> df2.show()
>>> df2.write.mode("overwrite").json("/tmp/pairs.json")
>>>
>>> root
>>>  |-- ID: integer (nullable = false)
>>>  |-- datA: string (nullable = true)
>>>
>>> +---++
>>> | ID|datA|
>>> +---++
>>> |  1|  a1|
>>> |  2|  a2|
>>> |  3|  a3|
>>> +---++
>>>
>>> root
>>>  |-- ID: integer (nullable = false)
>>>  |-- Struct: struct (nullable = false)
>>>  ||-- datA: string (nullable = true)
>>>
>>> +---+--+
>>> | ID|Struct|
>>> +---+--+
>>> |  1|  {a1}|
>>> |  2|  {a2}|
>>> |  3|  {a3}|
>>> +---+--+
>>>
>>> Look at the last line where json format is written
>>> df2.write.mode("overwrite").json("/tmp/pairs.json")
>>> Under the bonnet this happens
>>>
>>> hdfs dfs -ls /tmp/pairs.json
>>> Found 5 items
>>> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
>>> /tmp/pairs.json/_SUCCESS
>>> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
>>> /tmp/pairs.json/part-0-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>>> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
>>> /tmp/pairs.json/part-1-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>>> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
>>> /tmp/pairs.json/part-2-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>>> -rw-r--r--   3 hduser supe

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hi Mich,
Thank you.
Are you saying this satisfies my requirement?

On the other hand, I am smelling something going on. Perhaps the Spark
'part' files should not be thought of as files, but rather pieces of a
conceptual file. If that is true, then your approach (of which I'm well
aware) makes sense. Question: what are some good methods, tools, for
combining the parts into a single, well-named file? I imagine that is
outside of the scope of PySpark, but any advice is welcome.

Thank you,
Marco.

On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh 
wrote:

> AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they
> do sharding to improve read performance when writing to HCFS file systems.
>
> Let us take your code for a drive
>
> import findspark
> findspark.init()
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import struct
> from pyspark.sql.types import *
> spark = SparkSession.builder \
> .getOrCreate()
> pairs = [(1, "a1"), (2, "a2"), (3, "a3")]
> Schema = StructType([ StructField("ID", IntegerType(), False),
>   StructField("datA" , StringType(), True)])
> df = spark.createDataFrame(data=pairs,schema=Schema)
> df.printSchema()
> df.show()
> df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct"))
> df2.printSchema()
> df2.show()
> df2.write.mode("overwrite").json("/tmp/pairs.json")
>
> root
>  |-- ID: integer (nullable = false)
>  |-- datA: string (nullable = true)
>
> +---++
> | ID|datA|
> +---++
> |  1|  a1|
> |  2|  a2|
> |  3|  a3|
> +---++
>
> root
>  |-- ID: integer (nullable = false)
>  |-- Struct: struct (nullable = false)
>  ||-- datA: string (nullable = true)
>
> +---+--+
> | ID|Struct|
> +---+--+
> |  1|  {a1}|
> |  2|  {a2}|
> |  3|  {a3}|
> +---+--+
>
> Look at the last line where json format is written
> df2.write.mode("overwrite").json("/tmp/pairs.json")
> Under the bonnet this happens
>
> hdfs dfs -ls /tmp/pairs.json
> Found 5 items
> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
> /tmp/pairs.json/_SUCCESS
> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
> /tmp/pairs.json/part-0-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
> /tmp/pairs.json/part-1-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
> /tmp/pairs.json/part-2-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
> /tmp/pairs.json/part-3-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 4 May 2023 at 21:38, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> Hello,
>>
>> I am testing writing my DataFrame to S3 using the DataFrame `write`
>> method. It mostly does a great job. However, it fails one of my
>> requirements. Here are my requirements.
>>
>> - Write to S3
>> - use `partitionBy` to automatically make folders based on my chosen
>> partition columns
>> - control the resultant filename (whole or in part)
>>
>> I can get the first two requirements met but not the third.
>>
>> Here's an example. When I use the commands...
>>
>> df.write.partitionBy("year","month").mode("append")\
>> .json('s3a://bucket_name/test_folder/')
>>
>> ... I get the partitions I need. However, the filenames are something
>> like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json
>>
>>
>> Now, I understand Spark's need to include the partition number in the
>> filename. However, it sure would be nice to control the rest of the file
>> name.
>>
>>
>> Any advice? Please and thank you.
>>
>> Marco.
>>
>


Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hello,

I am testing writing my DataFrame to S3 using the DataFrame `write` method.
It mostly does a great job. However, it fails one of my requirements. Here
are my requirements.

- Write to S3
- use `partitionBy` to automatically make folders based on my chosen
partition columns
- control the resultant filename (whole or in part)

I can get the first two requirements met but not the third.

Here's an example. When I use the commands...

df.write.partitionBy("year","month").mode("append")\
.json('s3a://bucket_name/test_folder/')

... I get the partitions I need. However, the filenames are something
like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json


Now, I understand Spark's need to include the partition number in the
filename. However, it sure would be nice to control the rest of the file
name.


Any advice? Please and thank you.

Marco.


Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico,
What a great answer. Thank you. Seems like I need to get comfortable with
the 'struct' and then I will be golden. Thank you again, friend.

Marco.

On Thu, May 4, 2023 at 3:00 AM Enrico Minack  wrote:

> Hi,
>
> You could rearrange the DataFrame so that writing the DataFrame as-is
> produces your structure:
>
> df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int,
> datA string")
> +---++
> | id|datA|
> +---++
> |  1|  a1|
> |  2|  a2|
> |  3|  a3|
> +---++
>
> df2 = df.select(df.id, struct(df.datA).alias("stuff"))
> root
>   |-- id: integer (nullable = true)
>   |-- stuff: struct (nullable = false)
>   ||-- datA: string (nullable = true)
> +---+-+
> | id|stuff|
> +---+-+
> |  1| {a1}|
> |  2| {a2}|
> |  3| {a3}|
> +---+-+
>
> df2.write.json("data.json")
> {"id":1,"stuff":{"datA":"a1"}}
> {"id":2,"stuff":{"datA":"a2"}}
> {"id":3,"stuff":{"datA":"a3"}}
>
> Looks pretty much like what you described.
>
> Enrico
>
>
> Am 04.05.23 um 06:37 schrieb Marco Costantini:
> > Hello,
> >
> > Let's say I have a very simple DataFrame, as below.
> >
> > +---++
> > | id|datA|
> > +---++
> > |  1|  a1|
> > |  2|  a2|
> > |  3|  a3|
> > +---++
> >
> > Let's say I have a requirement to write this to a bizarre JSON
> > structure. For example:
> >
> > {
> >   "id": 1,
> >   "stuff": {
> > "datA": "a1"
> >   }
> > }
> >
> > How can I achieve this with PySpark? I have only seen the following:
> > - writing the DataFrame as-is (doesn't meet requirement)
> > - using a UDF (seems frowned upon)
> >
> > What I have tried is to do this within a `foreach`. I have had some
> > success, but also some problems with other requirements (serializing
> > other things).
> >
> > Any advice? Please and thank you,
> > Marco.
>
>
>


Write custom JSON from DataFrame in PySpark

2023-05-03 Thread Marco Costantini
Hello,

Let's say I have a very simple DataFrame, as below.

+---++
| id|datA|
+---++
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---++

Let's say I have a requirement to write this to a bizarre JSON structure.
For example:

{
  "id": 1,
  "stuff": {
"datA": "a1"
  }
}

How can I achieve this with PySpark? I have only seen the following:
- writing the DataFrame as-is (doesn't meet requirement)
- using a UDF (seems frowned upon)

What I have tried is to do this within a `foreach`. I have had some
success, but also some problems with other requirements (serializing other
things).

Any advice? Please and thank you,
Marco.


Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
Thanks team,
Email was just an example. The point was to illustrate that some actions
could be chained using Spark's foreach. In reality, this is an S3 write and
a Kafka message production, which I think is quite reasonable for spark to
do.

To answer Ayan's first question. Yes, all a users orders, prepared for each
and every user.

Other than the remarks that email transmission is unwise (which I've now
reminded is irrelevant) I am not seeing an alternative to using Spark's
foreach. Unless, your proposal is for the Spark job to target 1 user, and
just run the job 1000's of times taking the user_id as input. That doesn't
sound attractive.

Also, while we say that foreach is not optimal, I cannot find any evidence
of it; neither here nor online. If there are any docs about the inner
workings of this functionality, please pass them to me. I continue to
search for them. Even late last night!

Thanks for your help team,
Marco.

On Wed, Apr 26, 2023 at 6:21 AM Mich Talebzadeh 
wrote:

> Indeed very valid points by Ayan. How email is going to handle 1000s of
> records. As a solution architect I tend to replace. Users by customers and
> for each order there must be products sort of many to many relationship. If
> I was a customer I would also be interested in product details as
> well.sending via email sounds like a Jurassic park solution 
>
> On Wed, 26 Apr 2023 at 10:24, ayan guha  wrote:
>
>> Adding to what Mitch said,
>>
>> 1. Are you trying to send statements of all orders to all users? Or the
>> latest order only?
>>
>> 2. Sending email is not a good use of spark. instead, I suggest to use a
>> notification service or function. Spark should write to a queue (kafka,
>> sqs...pick your choice here).
>>
>> Best regards
>> Ayan
>>
>> On Wed, 26 Apr 2023 at 7:01 pm, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well OK in a nutshell you want the result set for every user prepared
>>> and email to that user right.
>>>
>>> This is a form of ETL where those result sets need to be posted
>>> somewhere. Say you create a table based on the result set prepared for each
>>> user. You may have many raw target tables at the end of the first ETL. How
>>> does this differ from using forEach? Performance wise forEach may not be
>>> optimal.
>>>
>>> Can you take the sample tables and try your method?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 26 Apr 2023 at 04:10, Marco Costantini <
>>> marco.costant...@rocketfncl.com> wrote:
>>>
>>>> Hi Mich,
>>>> First, thank you for that. Great effort put into helping.
>>>>
>>>> Second, I don't think this tackles the technical challenge here. I
>>>> understand the windowing as it serves those ranks you created, but I don't
>>>> see how the ranks contribute to the solution.
>>>> Third, the core of the challenge is about performing this kind of
>>>> 'statement' but for all users. In this example we target Mich, but that
>>>> reduces the complexity by a lot! In fact, a simple join and filter would
>>>> solve that one.
>>>>
>>>> Any thoughts on that? For me, the foreach is desirable because I can
>>>> have the workers chain other actions to each iteration (send email, send
>>>> HTTP request, etc).
>>>>
>>>> Thanks Mich,
>>>> Marco.
>>>>
>>>> On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Marco,
>>>>>
>>>>> First thoughts.
>>>>>
>>>>> foreach() is an action operation that is to iterate/loop over each
>>>>> element in the dataset, meaning cursor based. Tha

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
h order|106.11|   106.11|
> |Mich|   50007| Mich's 7th order|107.11|   107.11|
> |Mich|   50008| Mich's 8th order|108.11|   108.11|
> |Mich|   50009| Mich's 9th order|109.11|   109.11|
> |Mich|   50010|Mich's 10th order|210.11|   210.11|
> +++-+--+-+
>
> You can start on this.  Happy coding
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 25 Apr 2023 at 18:50, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> Thanks Mich,
>>
>> Great idea. I have done it. Those files are attached. I'm interested to
>> know your thoughts. Let's imagine this same structure, but with huge
>> amounts of data as well.
>>
>> Please and thank you,
>> Marco.
>>
>> On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Marco,
>>>
>>> Let us start simple,
>>>
>>> Provide a csv file of 5 rows for the users table. Each row has a unique
>>> user_id and one or two other columns like fictitious email etc.
>>>
>>> Also for each user_id, provide 10 rows of orders table, meaning that
>>> orders table has 5 x 10 rows for each user_id.
>>>
>>> both as comma separated csv file
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 25 Apr 2023 at 14:07, Marco Costantini <
>>> marco.costant...@rocketfncl.com> wrote:
>>>
>>>> Thanks Mich,
>>>> I have not but I will certainly read up on this today.
>>>>
>>>> To your point that all of the essential data is in the 'orders' table;
>>>> I agree! That distills the problem nicely. Yet, I still have some questions
>>>> on which someone may be able to shed some light.
>>>>
>>>> 1) If my 'orders' table is very large, and will need to be aggregated
>>>> by 'user_id', how will Spark intelligently optimize on that constraint
>>>> (only read data for relevent 'user_id's). Is that something I have to
>>>> instruct Spark to do?
>>>>
>>>> 2) Without #1, even with windowing, am I asking each partition to
>>>> search too much?
>>>>
>>>> Please, if you have any links to documentation I can read on *how*
>>>> Spark works under the hood for these operations, I would appreciate it if
>>>> you give them. Spark has become a pillar on my team and knowing it in more
>>>> detail is warranted.
>>>>
>>>> Slightly pivoting the subject here; I have tried something. It was a
>>>> suggestion by an AI chat bot and it seemed reasonable. In my main Spark
>>>> script I now have the line:
>>>>
>>>> ```
>>>> grouped_orders_df =
>>>> orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
>>>> 'timestamp', 'total', 'description'))).alias('orders'))
>>>> ```
>>>> (json is ultimately needed)
>>>>
>>>> This actually achieves my goal by putting all of the 'orders' in a
>>>> single Array column. Now my worry is, will this column become too large if
>>>> there are a great ma

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich,

Great idea. I have done it. Those files are attached. I'm interested to
know your thoughts. Let's imagine this same structure, but with huge
amounts of data as well.

Please and thank you,
Marco.

On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh 
wrote:

> Hi Marco,
>
> Let us start simple,
>
> Provide a csv file of 5 rows for the users table. Each row has a unique
> user_id and one or two other columns like fictitious email etc.
>
> Also for each user_id, provide 10 rows of orders table, meaning that
> orders table has 5 x 10 rows for each user_id.
>
> both as comma separated csv file
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 25 Apr 2023 at 14:07, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> Thanks Mich,
>> I have not but I will certainly read up on this today.
>>
>> To your point that all of the essential data is in the 'orders' table; I
>> agree! That distills the problem nicely. Yet, I still have some questions
>> on which someone may be able to shed some light.
>>
>> 1) If my 'orders' table is very large, and will need to be aggregated by
>> 'user_id', how will Spark intelligently optimize on that constraint (only
>> read data for relevent 'user_id's). Is that something I have to instruct
>> Spark to do?
>>
>> 2) Without #1, even with windowing, am I asking each partition to search
>> too much?
>>
>> Please, if you have any links to documentation I can read on *how* Spark
>> works under the hood for these operations, I would appreciate it if you
>> give them. Spark has become a pillar on my team and knowing it in more
>> detail is warranted.
>>
>> Slightly pivoting the subject here; I have tried something. It was a
>> suggestion by an AI chat bot and it seemed reasonable. In my main Spark
>> script I now have the line:
>>
>> ```
>> grouped_orders_df =
>> orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
>> 'timestamp', 'total', 'description'))).alias('orders'))
>> ```
>> (json is ultimately needed)
>>
>> This actually achieves my goal by putting all of the 'orders' in a single
>> Array column. Now my worry is, will this column become too large if there
>> are a great many orders. Is there a limit? I have search for documentation
>> on such a limit but could not find any.
>>
>> I truly appreciate your help Mich and team,
>> Marco.
>>
>>
>> On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Have you thought of using  windowing function
>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>> achieve this?
>>>
>>> Effectively all your information is in the orders table.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
>>> marco.costant...@rocketfncl.com> wrote:
>>>
>>>> I have two tables: {users, orders}. In this example, let's say that for
>>>> each 1 User in the users table, there are 10 Orders in the orders 
>>>> table.
>>>>
>>

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich,
I have not but I will certainly read up on this today.

To your point that all of the essential data is in the 'orders' table; I
agree! That distills the problem nicely. Yet, I still have some questions
on which someone may be able to shed some light.

1) If my 'orders' table is very large, and will need to be aggregated by
'user_id', how will Spark intelligently optimize on that constraint (only
read data for relevent 'user_id's). Is that something I have to instruct
Spark to do?

2) Without #1, even with windowing, am I asking each partition to search
too much?

Please, if you have any links to documentation I can read on *how* Spark
works under the hood for these operations, I would appreciate it if you
give them. Spark has become a pillar on my team and knowing it in more
detail is warranted.

Slightly pivoting the subject here; I have tried something. It was a
suggestion by an AI chat bot and it seemed reasonable. In my main Spark
script I now have the line:

```
grouped_orders_df =
orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
'timestamp', 'total', 'description'))).alias('orders'))
```
(json is ultimately needed)

This actually achieves my goal by putting all of the 'orders' in a single
Array column. Now my worry is, will this column become too large if there
are a great many orders. Is there a limit? I have search for documentation
on such a limit but could not find any.

I truly appreciate your help Mich and team,
Marco.


On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh 
wrote:

> Have you thought of using  windowing function
> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
> achieve this?
>
> Effectively all your information is in the orders table.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> I have two tables: {users, orders}. In this example, let's say that for
>> each 1 User in the users table, there are 10 Orders in the orders table.
>>
>> I have to use pyspark to generate a statement of Orders for each User.
>> So, a single user will need his/her own list of Orders. Additionally, I
>> need to send this statement to the real-world user via email (for example).
>>
>> My first intuition was to apply a DataFrame.foreach() on the users
>> DataFrame. This way, I can rely on the spark workers to handle the email
>> sending individually. However, I now do not know the best way to get each
>> User's Orders.
>>
>> I will soon try the following (pseudo-code):
>>
>> ```
>> users_df = 
>> orders_df = 
>>
>> #this is poorly named for max understandability in this context
>> def foreach_function(row):
>>   user_id = row.user_id
>>   user_orders_df = orders_df.select(f'user_id = {user_id}')
>>
>>   #here, I'd get any User info from 'row'
>>   #then, I'd convert all 'user_orders' to JSON
>>   #then, I'd prepare the email and send it
>>
>> users_df.foreach(foreach_function)
>> ```
>>
>> It is my understanding that if I do my user-specific work in the foreach
>> function, I will capitalize on Spark's scalability when doing that work.
>> However, I am worried of two things:
>>
>> If I take all Orders up front...
>>
>> Will that work?
>> Will I be taking too much? Will I be taking Orders on partitions who
>> won't handle them (different User).
>>
>> If I create the orders_df (filtered) within the foreach function...
>>
>> Will it work?
>> Will that be too much IO to DB?
>>
>> The question ultimately is: How can I achieve this goal efficiently?
>>
>> I have not yet tried anything here. I am doing so as we speak, but am
>> suffering from choice-paralysis.
>>
>> Please and thank you.
>>
>


What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 10 Orders in the orders table.

I have to use pyspark to generate a statement of Orders for each User. So,
a single user will need his/her own list of Orders. Additionally, I need to
send this statement to the real-world user via email (for example).

My first intuition was to apply a DataFrame.foreach() on the users
DataFrame. This way, I can rely on the spark workers to handle the email
sending individually. However, I now do not know the best way to get each
User's Orders.

I will soon try the following (pseudo-code):

```
users_df = 
orders_df = 

#this is poorly named for max understandability in this context
def foreach_function(row):
  user_id = row.user_id
  user_orders_df = orders_df.select(f'user_id = {user_id}')

  #here, I'd get any User info from 'row'
  #then, I'd convert all 'user_orders' to JSON
  #then, I'd prepare the email and send it

users_df.foreach(foreach_function)
```

It is my understanding that if I do my user-specific work in the foreach
function, I will capitalize on Spark's scalability when doing that work.
However, I am worried of two things:

If I take all Orders up front...

Will that work?
Will I be taking too much? Will I be taking Orders on partitions who won't
handle them (different User).

If I create the orders_df (filtered) within the foreach function...

Will it work?
Will that be too much IO to DB?

The question ultimately is: How can I achieve this goal efficiently?

I have not yet tried anything here. I am doing so as we speak, but am
suffering from choice-paralysis.

Please and thank you.


What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
Marco Costantini 
5:55 PM (5 minutes ago)
to user
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 10 Orders in the orders table.

I have to use pyspark to generate a statement of Orders for each User. So,
a single user will need his/her own list of Orders. Additionally, I need to
send this statement to the real-world user via email (for example).

My first intuition was to apply a DataFrame.foreach() on the users
DataFrame. This way, I can rely on the spark workers to handle the email
sending individually. However, I now do not know the best way to get each
User's Orders.

I will soon try the following (pseudo-code):

```
users_df = 
orders_df = 

#this is poorly named for max understandability in this context
def foreach_function(row):
  user_id = row.user_id
  user_orders_df = orders_df.select(f'user_id = {user_id}')

  #here, I'd get any User info from 'row'
  #then, I'd convert all 'user_orders' to JSON
  #then, I'd prepare the email and send it

users_df.foreach(foreach_function)
```

It is my understanding that if I do my user-specific work in the foreach
function, I will capitalize on Spark's scalability when doing that work.
However, I am worried of two things:

If I take all Orders up front...

Will that work?
Will I be taking too much? Will I be taking Orders on partitions who won't
handle them (different User).

If I create the orders_df (filtered) within the foreach function...

Will it work?
Will that be too much IO to DB?

The question ultimately is: How can I achieve this goal efficiently?

I have not yet tried anything here. I am doing so as we speak, but am
suffering from choice-paralysis.

Please and thank you.


Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
AMIs.


On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 And for the record, that AMI is ami-35b1885c. Again, you don't need to
 specify it explicitly; spark-ec2 will default to it.


 On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Marco,

 If you call spark-ec2 launch without specifying an AMI, it will default
 to the Spark-provided AMI.

 Nick


 On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI
 and add more packages you like -- I have definitely done this recently and
 it works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None 
 of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new 
 AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community
 to know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please 
 and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s
 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and
 that it must be 'root'. The typical workaround is as you said, allow 
 the
 ssh with the root user. Now, don't laugh, but, this worked last 
 Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the
 root user's 'authorized_keys' file is always overwritten. This means 
 the
 workaround doesn't work anymore! I would LOVE for someone to verify 
 this.

 Just to point out, I am trying to make this work with a
 paravirtual instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access
 and a lot of internal scripts assume have the user's home directory 
 hard
 coded as /root.   However all the Spark AMIs

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've
created several of my own AMIs with the following characteristics. None of
which worked.

1) Enabling ssh as root as per this guide (
http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
When doing this, I do not specify a user for the spark-ec2 script. What
happens is that, it works! But only while it's alive. If I stop the
instance, create an AMI, and launch a new instance based from the new AMI,
the change I made in the '/root/.ssh/authorized_keys' file is overwritten

2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
does not have to use sudo to perform any operations needing root
privilidges. When doing this, I specify the user 'ec2-user' for the
spark-ec2 script. An error occurs: rsync fails with exit code 23.

I believe HVMs still work. But it would be valuable to the community to
know that the root user work-around does/doesn't work any more for
paravirtual instances.

Thanks,
Marco.


On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell script
 which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory
 of spark, as usual. The AMI used is the raw one from the AWS Quick Start
 section. It is the first option (an Amazon Linux paravirtual image). Any
 ideas or confirmation would be GREATLY appreciated. Please and thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
 marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to launch
 the instances ? The typical workflow is to use the spark-ec2 wrapper script
 using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for
 ssh. Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user
 to log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.








Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the workaround ...around... by overwriting the
generated '/root/.ssh/authorized_keys' file with a known good one, in the
'/etc/rc.local' file


On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally I've
 created several of my own AMIs with the following characteristics. None of
 which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
 does not have to use sudo to perform any operations needing root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community to
 know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
 marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded 
 as
 /root.   However all the Spark AMIs we build should have root ssh access 
 --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for
 ssh. Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user
 to log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.









AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.

Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.

I can see that the Spark-EC2 script allows you to specify which user to log
in with, but even when I change this, the script fails for various reasons.
And the output SEEMS that the script is still based on the specified user's
home directory being '/root'.

Am I using this script wrong?
Has anyone had success with this 'ec2-user' user?
Any ideas?

Please and thank you,
Marco.


Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram,

OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...

...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
'authorized_keys' file is always overwritten. This means the workaround
doesn't work anymore! I would LOVE for someone to verify this.

Just to point out, I am trying to make this work with a paravirtual
instance and not an HVM instance.

Please and thanks,
Marco.


On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a lot
 of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.