Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram



Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation 
<https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
 of the MySQL client-server protocol. It does not use JDBC.


> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan 
>  wrote:
> 
> Thanks for the advice Nicholas. 
> 
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using 
> pymysql and sshtunnel and it worked fine. The problem happens only with Spark.
> 
> Thanks,
> Venkat
> 
> 
> 
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> This is not a question for the dev list. Moving dev to bcc.
>> 
>> One thing I would try is to connect to this database using JDBC + SSH 
>> tunnel, but without Spark. That way you can focus on getting the JDBC 
>> connection to work without Spark complicating the picture for you.
>> 
>> 
>>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>>> mailto:venkatesa...@noonacademy.com>> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is 
>>> same as the one in this Stackoverflow question 
>>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>>>  but there are no answers there.
>>> 
>>> This is what I am trying:
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port),
>>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>>> tunnel.local_bind_port
>>> b1_semester_df = spark.read \
>>> .format("jdbc") \
>>> .option("url", b2b_mysql_url.replace("<>", 
>>> str(tunnel.local_bind_port))) \
>>> .option("query", b1_semester_sql) \
>>> .option("database", 'b2b') \
>>> .option("password", b2b_mysql_password) \
>>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>>> .load()
>>> b1_semester_df.count()
>>> 
>>> Here, the b1_semester_df is loaded but when I try count on the same Df it 
>>> fails saying this
>>> 
>>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
>>> aborting job
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
>>> print(self._jdf.showString(n, 20, vertical))
>>>   File 
>>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
>>> 1257, in __call__
>>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>>> return f(*a, **kw)
>>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>> o284.showString.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
>>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
>>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
>>> failure
>>> 
>>> However, the same is working fine with pandas df. I have tried this below 
>>> and it worked.
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>>>passwd=sql_password, db=sql_main_database,
>>>port=tunnel.local_bind_port)
>>> df = pd.read_sql_query(b1_semester_sql, conn)
>>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>> 
>>> So wanted to check what I am missing with my Spark usage. Please help.
>>> 
>>> Thanks,
>>> Venkat
>>> 
>> 



Suppressing output from Apache Ivy (?) when calling spark-submit with --packages

2018-02-27 Thread Nicholas Chammas
I’m not sure whether this is something controllable via Spark, but when you
call spark-submit with --packages you get a lot of output. Is there any way
to suppress it? Does it come from Apache Ivy?

I posted more details about what I’m seeing on Stack Overflow
.

Nick


Re: Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nicholas Chammas
Here’s a repro for a very similar issue where Spark hangs on the UDF, which
I think is related to the SPARK_HOME issue. I posted the repro on the EMR
forum ,
but in case you can’t access it:

   1. I’m running EMR 5.6.0, Spark 2.1.1, and Python 3.5.1.
   2. Create a simple Python package by creating a directory called udftest.
   3. Inside udftest put an empty __init__.py and a nothing.py.
   4.

   nothing.py should have the following contents:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def do_nothing(s: int) -> int:
return s

do_nothing_udf = udf(do_nothing, IntegerType())

   5.

   From your home directory (the one that contains your udftest package),
   create a ZIP that we will ship to YARN.

pushd udftest/
zip -rq ../udftest.zip *
popd

   6.

   Start a PySpark shell with our test package.

export PYSPARK_PYTHON=python3
pyspark \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYSPARK_PYTHON" \
  --archives "udftest.zip#udftest"

   7.

   Now try to use the UDF. It will hang.

from udftest.nothing import do_nothing_udf
spark.range(10).select(do_nothing_udf('id')).show()  # hangs

   8.

   The strange thing is, if you define the exact same UDF directly in the
   active PySpark shell, it works fine! It’s only when you import it from a
   user-defined module that you see this issue.

​

On Thu, Jun 22, 2017 at 12:08 PM Nick Chammas 
wrote:

> I’m seeing a strange issue on EMR which I posted about here
> 
> .
>
> In brief, when I try to import a UDF I’ve defined, Python somehow fails to
> find Spark. This exact code works for me locally and works on our
> on-premises CDH cluster under YARN.
>
> This is the traceback:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
> print(self._jdf.showString(n, 20))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o89.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, ip-10-97-35-12.ec2.internal, executor 1): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 161, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 91, in read_udfs
> _, udf = read_single_udf(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 78, in read_single_udf
> f, return_type = read_command(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 54, in read_command
> command = serializer._read_with_length(file)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 169, in _read_with_length
> return self.loads(obj)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 451, in loads
> return pickle.loads(obj, encoding=encoding)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/person.py",
>  line 7, in 
> from splinkr.util import repartition_to_size
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/util.py",
>  line 34, in 
> containsNull=False,
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1872, in udf
> return UserDefinedFunction(f, returnType)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1830, in __init__
> self._judf = self._create_judf(name

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Ah, that's why all the stuff about scheduler pools is under the
section "Scheduling
Within an Application
<https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application>".
😅 I am so used to talking to my coworkers about jobs in sense of
applications that I forgot your typical Spark application submits multiple
"jobs", each of which has multiple stages, etc.

So in my case I need to read up more closely about YARN queues
<https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
since I want to share resources *across* applications. Thanks Mark!

On Wed, Apr 5, 2017 at 4:31 PM Mark Hamstra  wrote:

> `spark-submit` creates a new Application that will need to get resources
> from YARN. Spark's scheduler pools will determine how those resources are
> allocated among whatever Jobs run within the new Application.
>
> Spark's scheduler pools are only relevant when you are submitting multiple
> Jobs within a single Application (i.e., you are using the same SparkContext
> to launch multiple Jobs) and you have used SparkContext#setLocalProperty to
> set "spark.scheduler.pool" to something other than the default pool before
> a particular Job intended to use that pool is started via that SparkContext.
>
> On Wed, Apr 5, 2017 at 1:11 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Hmm, so when I submit an application with `spark-submit`, I need to
> guarantee it resources using YARN queues and not Spark's scheduler pools.
> Is that correct?
>
> When are Spark's scheduler pools relevant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra 
> wrote:
>
> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra 
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas 
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> <https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools>
> and YARN queues
> <https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>.
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>
>
>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Hmm, so when I submit an application with `spark-submit`, I need to
guarantee it resources using YARN queues and not Spark's scheduler pools.
Is that correct?

When are Spark's scheduler pools relevant/useful in this context?

On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra  wrote:

> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra 
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas 
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> 
> and YARN queues
> .
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>


Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A
related issue from the current issue tracker that you may want to
follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74

As I said there, I think requiring custom AMIs is one of the major
maintenance headaches of spark-ec2. I solved this problem in my own
project, Flintrock , by working with
the default Amazon Linux AMIs and letting people more freely bring their
own AMI.

Nick


On Thu, Feb 23, 2017 at 7:23 AM in4maniac  wrote:

> Hyy all,
>
> I have been using the EC2 script to launch R&D pyspark clusters for a while
> now. As we use alot of packages such as numpy and scipy with openblas,
> scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we
> have
> been building AMIs on top of the standard spark-AMIs at
> https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1
>
> Mainly, I have done the following:
> - updated yum
> - Changed the standard python to python 2.7
> - changed pip to 2.7 and installed alot of libararies on top of the
> existing
> AMIs and created my own AMIs to avoid having to boostrap.
>
> But the ec-2 standard AMIs are from *Early February , 2014* and now have
> become extremely fragile. For example, when I update a certain library,
> ipython would break, or pip would break and so forth.
>
> Can someone please direct me to a more upto date AMI that I can use with
> more confidence. And I am also interested to know what things need to be in
> the AMI, if I wanted to build an AMI from scratch (Last resort :( )
>
> And isn't it time to have a ticket in the spark project to build a new
> suite
> of AMIs for the EC2 script?
> https://issues.apache.org/jira/browse/SPARK-922
>
> Many thanks
> in4maniac
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Nicholas Chammas
RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.
​

On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
david.hagl...@husqvarnagroup.com> wrote:

> Hi,
>
>
>
> I found something that surprised me, I expected the order of the rows to
> be preserved, so I suspect this might be a bug. The problem is illustrated
> with the Python example below:
>
>
>
> In [1]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[1]:
>
> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>
>
>
> Note how n=1 comes before n=0, above.
>
>
>
>
>
> If I remove the cache line I get the rows in the correct order and the
> same if I use df.rdd.count() instead of df.count(), see examples below:
>
>
>
> In [2]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[2]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
> In [3]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.rdd.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[3]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
>
>
> I use spark 2.1.0 and pyspark.
>
>
>
> Regards,
>
> /David
>
> The information in this email may be confidential and/or legally
> privileged. It has been sent for the sole use of the intended recipient(s).
> If you are not an intended recipient, you are strictly prohibited from
> reading, disclosing, distributing, copying or using this email or any of
> its contents, in any way whatsoever. If you have received this email in
> error, please contact the sender by reply email and destroy all copies of
> the original message. Please also be advised that emails are not a secure
> form for communication, and may contain errors.
>


Re: Debugging a PythonException with no details

2017-01-17 Thread Nicholas Chammas
Hey Marco,

I stopped seeing this error once I started round-tripping intermediate
DataFrames to disk.

You can read more about what I saw here:
https://github.com/graphframes/graphframes/issues/159

Nick

On Sat, Jan 14, 2017 at 4:02 PM Marco Mistroni  wrote:

> It seems it has to do with UDF..Could u share snippet of code you are
> running?
> Kr
>
> On 14 Jan 2017 1:40 am, "Nicholas Chammas" 
> wrote:
>
> I’m looking for tips on how to debug a PythonException that’s very sparse
> on details. The full exception is below, but the only interesting bits
> appear to be the following lines:
>
> org.apache.spark.api.python.PythonException:
> ...
> py4j.protocol.Py4JError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext
>
> Otherwise, the only other clue from the traceback I can see is that the
> problem may involve a UDF somehow.
>
> I’ve tested this code against many datasets (stored as ORC) and it works
> fine. The same code only seems to throw this error on a few datasets that
> happen to be sourced via JDBC. I can’t seem to get a lead on what might be
> going wrong here.
>
> Does anyone have tips on how to debug a problem like this? How do I find
> more specifically what is going wrong?
>
> Nick
>
> Here’s the full exception:
>
> 17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 15, 
> devlx023.private.massmutual.com, executor 4): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 161, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 97, 
> in read_udfs
> arg_offsets, udf = read_single_udf(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 78, 
> in read_single_udf
> f, return_type = read_command(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 54, 
> in read_command
> command = serializer._read_with_length(file)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 169, in _read_with_length
> return self.loads(obj)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 431, in loads
> return pickle.loads(obj, encoding=encoding)
>   File 
> "/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_05/splinkr/person.py",
>  line 111, in 
> py_normalize_udf = udf(py_normalize, StringType())
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1868, in udf
> return UserDefinedFunction(f, returnType)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1826, in __init__
> self._judf = self._create_judf(name)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1830, in _create_judf
> sc = SparkContext.getOrCreate()
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 307, in getOrCreate
> SparkContext(conf=conf or SparkConf())
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 118, in __init__
> conf, jsc, profiler_cls)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 179, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 246, in _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1401, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 327, in get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext
>
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at 
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
> at 
> org.apache.s

Debugging a PythonException with no details

2017-01-13 Thread Nicholas Chammas
I’m looking for tips on how to debug a PythonException that’s very sparse
on details. The full exception is below, but the only interesting bits
appear to be the following lines:

org.apache.spark.api.python.PythonException:
...
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

Otherwise, the only other clue from the traceback I can see is that the
problem may involve a UDF somehow.

I’ve tested this code against many datasets (stored as ORC) and it works
fine. The same code only seems to throw this error on a few datasets that
happen to be sourced via JDBC. I can’t seem to get a lead on what might be
going wrong here.

Does anyone have tips on how to debug a problem like this? How do I find
more specifically what is going wrong?

Nick

Here’s the full exception:

17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID
15, devlx023.private.massmutual.com, executor 4):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 161, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 97, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 78, in read_single_udf
f, return_type = read_command(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 54, in read_command
command = serializer._read_with_length(file)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 169, in _read_with_length
return self.loads(obj)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 431, in loads
return pickle.loads(obj, encoding=encoding)
  File 
"/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_05/splinkr/person.py",
line 111, in 
py_normalize_udf = udf(py_normalize, StringType())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1868, in udf
return UserDefinedFunction(f, returnType)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1826, in __init__
self._judf = self._create_judf(name)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1830, in _create_judf
sc = SparkContext.getOrCreate()
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 307, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 118, in __init__
conf, jsc, profiler_cls)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 179, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 246, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1401, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
at 
org.apache.

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
I wish I could provide additional suggestions. Maybe one of the admins can
step in and help. I'm just another random user trying (with mixed success)
to be helpful. 😅

Sorry again to everyone about my spam, which just added to the problem.

On Thu, Dec 8, 2016 at 11:22 AM Chen, Yan I  wrote:

> I’m pretty sure I didn’t.
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:56 AM
> *To:* Chen, Yan I; Di Zhu
>
>
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Oh, hmm...
>
> Did you perhaps subscribe with a different address than the one you're
> trying to unsubscribe from?
>
> For example, you subscribed with myemail+sp...@gmail.com but you send the
> unsubscribe email from myem...@gmail.com
>
> 2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 님이 작성:
>
> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu 
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. *
>
> * Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
> ___
>
> If you received this email in er

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Oh, hmm...

Did you perhaps subscribe with a different address than the one you're
trying to unsubscribe from?

For example, you subscribed with myemail+sp...@gmail.com but you send the
unsubscribe email from myem...@gmail.com
2016년 12월 8일 (목) 오전 10:35, Chen, Yan I 님이 작성:

> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu 
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. Esta mensagem pode conter informação confidencial
> ou privilegiada, sendo seu sigilo protegido por lei. Se você não for o
> destinatário ou a pessoa autorizada a receber esta mensagem, não pode usar,
> copiar ou divulgar as informações nela contidas ou tomar qualquer ação
> baseada nessas informações. Se você recebeu esta mensagem por engano, por
> favor, avise imediatamente ao remetente, respondendo o e-mail e em seguida
> apague-a. Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Yes, sorry about that. I didn't think before responding to all those who
asked to unsubscribe.

On Thu, Dec 8, 2016 at 10:00 AM Di Zhu  wrote:

> Could you send to individual privately without cc to all users every time?
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas 
> wrote:
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva 
wrote:

>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:46 AM Tao Lu  wrote:

>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 8:01 AM Niki Pavlopoulou  wrote:

> unsubscribe
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:50 AM Juan Caravaca 
wrote:

> unsubscribe
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:54 AM Kishorkumar Patil
 wrote:

>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:42 AM Chen, Yan I  wrote:

>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:17 AM Prashant Singh Thakur <
prashant.tha...@impetus.co.in> wrote:

>
>
>
>
> Best Regards,
>
> Prashant Thakur
>
> Work : 6046
>
> Mobile: +91-9740266522 <+91%2097402%2066522>
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:08 AM Kranthi Gmail 
wrote:

>
>
> --
> Kranthi
>
> PS: Sent from mobile, pls excuse the brevity and typos.
>
> On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan <
> siddhartha.khai...@gmail.com> wrote:
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 6:27 AM Vinicius Barreto <
vinicius.s.barr...@gmail.com> wrote:

> Unsubscribe
>
> Em 7 de dez de 2016 17:46, "map reduced"  escreveu:
>
> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few
> jobs fail due to some (say kafka cluster maintenance etc, mostly
> unavoidable) reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a
> way of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems
> not possible.
> 2) From driver - run additional processing to check every few minutes
> using driver rest api (/api/v1/applications...) what jobs have failed and
> submit batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming
> context just for few failing batches to stop the job for some time and
> resume after few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:54 AM Roger Holenweger  wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 1:34 AM smith_666  wrote:

>
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:12 AM Ajit Jaokar 
wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Wed, Dec 7, 2016 at 10:53 PM Ajith Jose  wrote:

>
>


Re: Strongly Connected Components

2016-11-13 Thread Nicholas Chammas
FYI: There is a new connected components implementation coming in
GraphFrames 0.3.

See: https://github.com/graphframes/graphframes/pull/119

Implementation is based on:
https://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf

Nick

On Sat, Nov 12, 2016 at 3:01 PM Koert Kuipers  wrote:

> oh ok i see now its not the same
>
> On Sat, Nov 12, 2016 at 2:48 PM, Koert Kuipers  wrote:
>
> not sure i see the faster algo in the paper you mention.
>
> i see this in section 6.1.2:
> "In what follows we give a simple labeling algorithm that computes
> connectivity  on  sparse  graphs  in O(log N) rounds."
> N here is the size of the graph, not the largest component diameter.
>
> that is the exact same algo as is implemented in graphx i think. or is it
> not?
>
> On Fri, Nov 11, 2016 at 7:58 PM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
> Hi Shreya,
> GraphFrames just calls the GraphX strongly connected components code. (
> https://github.com/graphframes/graphframes/blob/release-0.2.0/src/main/scala/org/graphframes/lib/StronglyConnectedComponents.scala#L51
> )
>
> For choosing the number of iterations: If the number of iterations is less
> than the diameter of the graph, you may get an incorrect result. But
> running for more iterations than that buys you nothing. The algorithm is
> basically to broadcast your ID to all your neighbors in the first round,
> and then broadcast the smallest ID that you have seen so far in the next
> rounds. So with only 1 round you will get a wrong result unless each vertex
> is connected to the vertex with the lowest ID in that component. (Unlikely
> in a real graph.)
>
> See
> https://github.com/apache/spark/blob/v2.0.2/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
> for the actual implementation.
>
> A better algorithm exists for this problem that only requires O(log(N))
> iterations when N is the largest component diameter. (It is described in "A
> Model of Computation for MapReduce",
> http://www.sidsuri.com/Publications_files/mrc.pdf.) This outperforms
> GraphX's implementation immensely. (See the last slide of
> http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos#33.)
> The large advantage is due to the lower number of necessary iterations.
>
> For why this is failing even with one iteration: I would first check your
> partitioning. Too many or too few partitions could equally cause the issue.
> If you are lucky, there is no overlap between the "too many" and "too few"
> domains :).
>
> On Fri, Nov 11, 2016 at 7:39 PM, Shreya Agarwal 
> wrote:
>
> Tried GraphFrames. Still faced the same – job died after a few hours . The
> errors I see (And I see tons of them) are –
>
> (I ran with 3 times the partitions as well, which was 12 times number of
> executors , but still the same.)
>
>
>
> -
>
> ERROR NativeAzureFileSystem: Encountered Storage Exception for write on
> Blob : hdp/spark2-events/application_1478717432179_0021.inprogress
> Exception details: null Error Code : RequestBodyTooLarge
>
>
>
> -
>
>
>
> 16/11/11 09:21:46 ERROR TransportResponseHandler: Still have 3 requests
> outstanding when connection from /10.0.0.95:43301 is closed
>
> 16/11/11 09:21:46 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 2
> outstanding blocks after 5000 ms
>
> 16/11/11 09:21:46 INFO ShuffleBlockFetcherIterator: Getting 1500 non-empty
> blocks out of 1500 blocks
>
> 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
>
> java.io.IOException: Connection from /10.0.0.95:43301 closed
>
>
>
> -
>
>
>
> 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
>
> java.lang.RuntimeException: java.io.FileNotFoundException:
> /mnt/resource/hadoop/yarn/local/usercache/shreyagrssh/appcache/application_1478717432179_0021/blockmgr-b1dde30d-359e-4932-b7a4-a5e138a52360/37/shuffle_1346_21_0.index
> (No such file or directory)
>
>
>
> -
>
>
>
> org.apache.spark.SparkException: Exception thrown in awaitResult
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>
> at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>
> at org.apache.spark.executor.Executor.org
> $apache$spark$exe

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
I apologize for my harsh tone. You are right, it was unnecessary and
discourteous.

On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh 
wrote:

> Hi,
>
> You made such statement:
>
> "That's complete nonsense."
>
> That is a strong language and void of any courtesy. Only dogmatic
> individuals make such statements, engaging the keyboard before thinking
> about it.
>
> You are perfectly in your right to agree to differ. However, that does not
> give you the right to call other peoples opinion nonsense.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:54, Nicholas Chammas  > wrote:
>
>> You made a specific claim -- that Spark will move away from Python --
>> which I responded to with clear references and data. How on earth is that a
>> "religious argument"?
>>
>> I'm not saying that Python is better than Scala or anything like that.
>> I'm just addressing your specific claim about its future in the Spark
>> project.
>>
>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Right so. We are back into religious arguments. Best of luck
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 September 2016 at 15:35, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I believe as we progress in time Spark is going to move away from
>>>>> Python. If you look at 2014 Databricks code examples, they were
>>>>> mostly in Python. Now they are mostly in Scala for a reason.
>>>>>
>>>>
>>>> That's complete nonsense.
>>>>
>>>> First off, you can find dozens and dozens of Python code examples here:
>>>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>>>
>>>> The Python API was added to Spark in 0.7.0
>>>> <http://spark.apache.org/news/spark-0-7-0-released.html>, back in
>>>> February of 2013, before Spark was even accepted into the Apache incubator.
>>>> Since then it's undergone major and continuous development. Though it does
>>>> lag behind the Scala API in some areas, it's a first-class language and
>>>> bringing it up to parity with Scala is an explicit project goal. A quick
>>>> example off the top of my head is all the work that's going into model
>>>> import/export for Python: SPARK-11939
>>>> <https://issues.apache.org/jira/browse/SPARK-11939>
>>>>
>>>> Additionally, according to the 2015 Spark Survey
>>>> <http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf?t=1472746902480>,
>>>> 58% of Spark users use the Python API, more than any other language save
>>>> for Scala (71%). (Users can select multiple languages on the survey.)
>>>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>>>> after Windows and Spark Streaming users.
>>>>
>>>> Any notion that Spark is going to "move away from Python" is completely
>>>> contradicted by the facts.
>>>>
>>>> Nick
>>>>
>>>>
>>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
You made a specific claim -- that Spark will move away from Python -- which
I responded to with clear references and data. How on earth is that a
"religious argument"?

I'm not saying that Python is better than Scala or anything like that. I'm
just addressing your specific claim about its future in the Spark project.

On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh 
wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> <http://spark.apache.org/news/spark-0-7-0-released.html>, back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> <https://issues.apache.org/jira/browse/SPARK-11939>
>>
>> Additionally, according to the 2015 Spark Survey
>> <http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf?t=1472746902480>,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
wrote:

> I believe as we progress in time Spark is going to move away from Python. If
> you look at 2014 Databricks code examples, they were mostly in Python. Now
> they are mostly in Scala for a reason.
>

That's complete nonsense.

First off, you can find dozens and dozens of Python code examples here:
https://github.com/apache/spark/tree/master/examples/src/main/python

The Python API was added to Spark in 0.7.0
, back in February
of 2013, before Spark was even accepted into the Apache incubator. Since
then it's undergone major and continuous development. Though it does lag
behind the Scala API in some areas, it's a first-class language and
bringing it up to parity with Scala is an explicit project goal. A quick
example off the top of my head is all the work that's going into model
import/export for Python: SPARK-11939


Additionally, according to the 2015 Spark Survey
,
58% of Spark users use the Python API, more than any other language save
for Scala (71%). (Users can select multiple languages on the survey.)
Python users were also the 3rd-fastest growing "demographic" for Spark,
after Windows and Spark Streaming users.

Any notion that Spark is going to "move away from Python" is completely
contradicted by the facts.

Nick


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Tue, Aug 9, 2016 at 5:14 PM abhishek singh  wrote:

>
>


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Tue, Aug 9, 2016 at 8:03 PM James Ding  wrote:

>
>


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Wed, Aug 10, 2016 at 2:46 AM Martin Somers  wrote:

>
>
> --
> M
>


Re: Unsubscribe

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html

On Tue, Aug 9, 2016 at 3:02 PM Hogancamp, Aaron <
aaron.t.hoganc...@leidos.com> wrote:

> Unsubscribe.
>
>
>
> Thanks,
>
>
>
> Aaron Hogancamp
>
> Data Scientist
>
>
>


Re: Unsubscribe.

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html

On Tue, Aug 9, 2016 at 3:05 PM Martin Somers  wrote:

> Unsubscribe.
>
> Thanks
> M
>


Re: Add column sum as new column in PySpark dataframe

2016-08-05 Thread Nicholas Chammas
I think this is what you need:

import pyspark.sql.functions as sqlf

df.withColumn('total', sqlf.sum(df.columns))

Nic

On Thu, Aug 4, 2016 at 9:41 AM Javier Rey jre...@gmail.com
 wrote:

Hi everybody,
>
> Sorry, I sent last mesage it was imcomplete this is complete:
>
> I'm using PySpark and I have a Spark dataframe with a bunch of numeric
> columns. I want to add a column that is the sum of all the other columns.
>
> Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
>
> df.withColumn('total_col', df.a + df.b + df.c)
>
> The problem is that I don't want to type out each column individually and
> add them, especially if I have a lot of columns. I want to be able to do
> this automatically or by specifying a list of column names that I want to
> add. Is there another way to do this?
>
> I find this solution:
>
> df.withColumn('total', sum(df[col] for col in df.columns))
>
> But I get this error:
>
> "AttributeError: 'generator' object has no attribute '_get_object_id"
>
> Additionally I want to sum onlt not nulls values.
>
> Thanks in advance,
>
> Samir
>
​


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
No, SQLContext is not disappearing. The top-level class is replaced by
SparkSession, but you can always get the underlying context from the
session.

You can also use SparkSession.udf.register()
<http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.udf>,
which is just a wrapper for sqlContext.registerFunction
<https://github.com/apache/spark/blob/2182e4322da6ba732f99ae75dce00f76f1cdc4d9/python/pyspark/sql/context.py#L511-L520>
.
​

On Thu, Aug 4, 2016 at 12:04 PM Ben Teeuwen  wrote:

> Yes, but I don’t want to use it in a select() call.
> Either selectExpr() or spark.sql(), with the udf being called inside a
> string.
>
> Now I got it to work using
> "sqlContext.registerFunction('encodeOneHot_udf',encodeOneHot, VectorUDT())”
> But this sqlContext approach will disappear, right? So I’m curious what to
> use instead.
>
> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas 
> wrote:
>
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:
>
>> Hi,
>>
>> I’d like to use a UDF in pyspark 2.0. As in ..
>> 
>>
>> def squareIt(x):
>>   return x * x
>>
>> # register the function and define return type
>> ….
>>
>> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
>> function_result from df’)
>>
>> _
>>
>> How can I register the function? I only see registerFunction in the
>> deprecated sqlContext at
>> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
>> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
>> way to go?
>>
>> Ben
>>
>
>


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
Have you looked at pyspark.sql.functions.udf and the associated examples?
2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:

> Hi,
>
> I’d like to use a UDF in pyspark 2.0. As in ..
> 
>
> def squareIt(x):
>   return x * x
>
> # register the function and define return type
> ….
>
> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
> function_result from df’)
>
> _
>
> How can I register the function? I only see registerFunction in the
> deprecated sqlContext at
> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
> way to go?
>
> Ben
>


Re: spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Nicholas Chammas
Yes, spark-ec2 has been removed from the main project, as called out in the
Release Notes:

http://spark.apache.org/releases/spark-release-2-0-0.html#removals

You can still discuss spark-ec2 here or on Stack Overflow, as before. Bug
reports and the like should now go on that AMPLab GitHub project as opposed
to JIRA, though.

You should use branch-2.0.

On Wed, Jul 27, 2016 at 2:30 PM Andy Davidson 
wrote:

> Congratulations on releasing 2.0!
>
>
> spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever
>  http://spark.apache.org/docs/latest/index.html  has a link to the
> spark-ec2 github repo https://github.com/amplab/spark-ec2
>
>
> Is this the right group to discuss spark-ec2?
>
> Any idea how stable spark-ec2 is on spark-2.0?
>
> Should we use master or branch-2.0? It looks like the default might be the
> branch-1.6 ?
>
> Thanks
>
> Andy
>
>
> P.s. The new stand alone documentation is a big improvement. I have a
> much better idea of what spark-ec2 does and how to upgrade my system.
>
>
>
>
>
>
>
>
>
>
>
>


Re: Unsubscribe - 3rd time

2016-06-29 Thread Nicholas Chammas
> I'm not sure I've ever come across an email list that allows you to
unsubscribe by responding to the list with "unsubscribe".

Many noreply lists (e.g. companies sending marketing email) actually work
that way, which is probably what most people are used to these days.

What this list needs is an unsubscribe link in the footer, like most modern
mailing lists have. Work to add this in is already in progress here:
https://issues.apache.org/jira/browse/INFRA-12185

Nick

On Wed, Jun 29, 2016 at 12:57 PM Jonathan Kelly 
wrote:

> If at first you don't succeed, try, try again. But please don't. :)
>
> See the "unsubscribe" link here: http://spark.apache.org/community.html
>
> I'm not sure I've ever come across an email list that allows you to
> unsubscribe by responding to the list with "unsubscribe". At least, all of
> the Apache ones have a separate address to which you send
> subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
> messages to the actual list almost every day.
>
> On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh 
> wrote:
>
>> LOL. Bravely said Joaquin.
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 June 2016 at 16:54, Joaquin Alzola 
>> wrote:
>>
>>> And 3rd time is not enough to know that unsubscribe is done through à
>>> user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>> *From:* Steve Florence [mailto:sflore...@ypm.com]
>>> *Sent:* 29 June 2016 16:47
>>> *To:* user@spark.apache.org
>>> *Subject:* Unsubscribe - 3rd time
>>>
>>>
>>>
>>>
>>> This email is confidential and may be subject to privilege. If you are
>>> not the intended recipient, please do not copy or disclose its content but
>>> contact the sender immediately upon receipt.
>>>
>>
>>


Re: Writing output of key-value Pair RDD

2016-05-04 Thread Nicholas Chammas
You're looking for this discussion:
http://stackoverflow.com/q/23995040/877069

Also, a simpler alternative with DataFrames:
https://github.com/apache/spark/pull/8375#issuecomment-202458325

On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick 
wrote:

> Hi,
>
>
> Is there any way to write out to S3 the values of a f key-value Pair RDD ?
>
>
> I'd like each value of a pair to be written to its own file where the file
> name corresponds to the key name.
>
>
> Thanks,
>
> --
>
> Nick
>


Re: spark-ec2 hitting yum install issues

2016-04-14 Thread Nicholas Chammas
If you log into the cluster and manually try that step does it still fail?
Can you yum install anything else?

You might want to report this issue directly on the spark-ec2 repo, btw:
https://github.com/amplab/spark-ec2

Nick

On Thu, Apr 14, 2016 at 9:08 PM sanusha  wrote:

>
> I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the
> spark-ec2 to launch a cluster in
> Amazon VPC. The setup.sh script [run first thing on master after launch]
> uses pssh and tries to install it
> via 'yum install -y pssh'. This step always fails on the master AMI that
> the
> script uses by default as it is
> not able to find it in the repo mirrors - hits 403.
>
> Has anyone faced this and know what's causing it? For now, I have changed
> the script to not use pssh
> as a workaround. But would like to fix the root cause.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-hitting-yum-install-issues-tp26786.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas
Yes, this is a known issue. The core devs are already aware of it. [CC dev]

FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt.
It may be the only 1.6.1 package that is not corrupt, though. :/

Nick


On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong 
wrote:

> Hi all,
>
> I'm trying to launch a cluster with the spark-ec2 script but seeing the
> error below.  Are the packages on S3 corrupted / not in the correct format?
>
> Initializing spark
>
> --2016-04-13 00:25:39--
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz
>
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67
>
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80...
> connected.
>
> HTTP request sent, awaiting response... 200 OK
>
> Length: 277258240 (264M) [application/x-compressed]
>
> Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’
>
> 100%[==>]
> 277,258,240 37.6MB/s   in 9.2s
>
> 2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved
> [277258240/277258240]
>
> Unpacking Spark
>
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> mv: missing destination file operand after `spark'
>
> Try `mv --help' for more information.
>
>
>
>
>
>
> --
> [image: Branch] 
> Augustus Hong
> Software Engineer
>
>


Re: Reading Back a Cached RDD

2016-03-24 Thread Nicholas Chammas
Isn’t persist() only for reusing an RDD within an active application? Maybe
checkpoint() is what you’re looking for instead?
​

On Thu, Mar 24, 2016 at 2:02 PM Afshartous, Nick 
wrote:

>
> Hi,
>
>
> After calling RDD.persist(), is then possible to come back later and
> access the persisted RDD.
>
> Let's say for instance coming back and starting a new Spark shell
> session.  How would one access the persisted RDD in the new shell session ?
>
>
> Thanks,
>
> --
>
>Nick
>


Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to
clarify, my comment earlier was this:

So in short, DataFrames are the “new RDD”—i.e. the new base structure you
should be using in your Spark programs wherever possible.

RDDs are not going away, and clearly in your case DataFrames are not that
helpful, so sure, continue to use RDDs. There’s nothing wrong with that.
No-one is saying you *must* use DataFrames, and Spark will continue to
offer its RDD API.

However, my original comment to Jules still stands: If you can, use
DataFrames. In most cases they will offer you a better development
experience and better performance across languages, and future Spark
optimizations will mostly be enabled by the structure that DataFrames
provide.

DataFrames are the “new RDD” in the sense that they are the new foundation
for much of the new work that has been done in recent versions and that is
coming in Spark 2.0 and beyond.

Many people work with semi-structured data and have a relatively easy path
to DataFrames, as I explained in my previous email. If, however, you’re
working with data that has very little structure, like in Darren’s case,
then yes, DataFrames are probably not going to help that much. Stick with
RDDs and you’ll be fine.
​

On Wed, Mar 2, 2016 at 6:28 PM Darren Govoni  wrote:

> Our data is made up of single text documents scraped off the web. We store
> these in a  RDD. A Dataframe or similar structure makes no sense at that
> point. And the RDD is transient.
>
> So my point is. Dataframes should not replace plain old rdd since rdds
> allow for more flexibility and sql etc is not even usable on our data while
> in rdd. So all those nice dataframe apis aren't usable until it's
> structured. Which is the core problem anyway.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Nicholas Chammas 
> Date: 03/02/2016 5:43 PM (GMT-05:00)
> To: Darren Govoni , Jules Damji ,
> Joshua Sorrell 
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> Plenty of people get their data in Parquet, Avro, or ORC files; or from a
> database; or do their initial loading of un- or semi-structured data using
> one of the various data source libraries
> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help
> with type-/schema-inference.
>
> All of these paths help you get to a DataFrame very quickly.
>
> Nick
>
> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni  wrote:
>
> Dataframes are essentially structured tables with schemas. So where does
>> the non typed data sit before it becomes structured if not in a traditional
>> RDD?
>>
>> For us almost all the processing comes before there is structure to it.
>>
>>
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Nicholas Chammas 
>> Date: 03/02/2016 5:13 PM (GMT-05:00)
>> To: Jules Damji , Joshua Sorrell 
>>
>> Cc: user@spark.apache.org
>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
>> features
>>
>> > However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please chime
>> in).
>>
>> The more your workload uses DataFrames, the less of a difference there
>> will be between the languages (Scala, Java, Python, or R) in terms of
>> performance.
>>
>> One of the main benefits of Catalyst (which DFs enable) is that it
>> automatically optimizes DataFrame operations, letting you focus on _what_
>> you want while Spark will take care of figuring out _how_.
>>
>> Tungsten takes things further by tightly managing memory using the type
>> information made available to it via DataFrames. This benefit comes into
>> play regardless of the language used.
>>
>> So in short, DataFrames are the "new RDD"--i.e. the new base structure
>> you should be using in your Spark programs wherever possible. And with
>> DataFrames, what language you use matters much less in terms of performance.
>>
>> Nick
>>
>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:
>>
>>> Hello Joshua,
>>>
>>> comments are inline...
>>>
>>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
>>>
>>> I haven't used Spark in the last year and a half. I am about to start a
>>> p

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a
database; or do their initial loading of un- or semi-structured data using
one of the various data source libraries
<http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help with
type-/schema-inference.

All of these paths help you get to a DataFrame very quickly.

Nick

On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni  wrote:

Dataframes are essentially structured tables with schemas. So where does
> the non typed data sit before it becomes structured if not in a traditional
> RDD?
>
> For us almost all the processing comes before there is structure to it.
>
>
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message 
> From: Nicholas Chammas 
> Date: 03/02/2016 5:13 PM (GMT-05:00)
> To: Jules Damji , Joshua Sorrell 
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> > However, I believe, investing (or having some members of your group)
> learn and invest in Scala is worthwhile for few reasons. One, you will get
> the performance gain, especially now with Tungsten (not sure how it relates
> to Python, but some other knowledgeable people on the list, please chime
> in).
>
> The more your workload uses DataFrames, the less of a difference there
> will be between the languages (Scala, Java, Python, or R) in terms of
> performance.
>
> One of the main benefits of Catalyst (which DFs enable) is that it
> automatically optimizes DataFrame operations, letting you focus on _what_
> you want while Spark will take care of figuring out _how_.
>
> Tungsten takes things further by tightly managing memory using the type
> information made available to it via DataFrames. This benefit comes into
> play regardless of the language used.
>
> So in short, DataFrames are the "new RDD"--i.e. the new base structure you
> should be using in your Spark programs wherever possible. And with
> DataFrames, what language you use matters much less in terms of performance.
>
> Nick
>
> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:
>
>> Hello Joshua,
>>
>> comments are inline...
>>
>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
>>
>> I haven't used Spark in the last year and a half. I am about to start a
>> project with a new team, and we need to decide whether to use pyspark or
>> Scala.
>>
>>
>> Indeed, good questions, and they do come up lot in trainings that I have
>> attended, where this inevitable question is raised.
>> I believe, it depends on your level of comfort zone or adventure into
>> newer things.
>>
>> True, for the most part that Apache Spark committers have been committed
>> to keep the APIs at parity across all the language offerings, even though
>> in some cases, in particular Python, they have lagged by a minor release.
>> To the the extent that they’re committed to level-parity is a good sign. It
>> might to be the case with some experimental APIs, where they lag behind,
>>  but for the most part, they have been admirably consistent.
>>
>> With Python there’s a minor performance hit, since there’s an extra level
>> of indirection in the architecture and an additional Python PID that the
>> executors launch to execute your pickled Python lambdas. Other than that it
>> boils down to your comfort zone. I recommend looking at Sameer’s slides on
>> (Advanced Spark for DevOps Training) where he walks through the pySpark and
>> Python architecture.
>>
>>
>> We are NOT a java shop. So some of the build tools/procedures will
>> require some learning overhead if we go the Scala route. What I want to
>> know is: is the Scala version of Spark still far enough ahead of pyspark to
>> be well worth any initial training overhead?
>>
>>
>> If you are a very advanced Python shop and if you’ve in-house libraries
>> that you have written in Python that don’t exist in Scala or some ML libs
>> that don’t exist in the Scala version and will require fair amount of
>> porting and gap is too large, then perhaps it makes sense to stay put with
>> Python.
>>
>> However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please chime
>> in). Two, since Spark is written in Scala, it gives you an enormous
>> advantage to read sources (which are well documented and highly read

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group)
learn and invest in Scala is worthwhile for few reasons. One, you will get
the performance gain, especially now with Tungsten (not sure how it relates
to Python, but some other knowledgeable people on the list, please chime
in).

The more your workload uses DataFrames, the less of a difference there will
be between the languages (Scala, Java, Python, or R) in terms of
performance.

One of the main benefits of Catalyst (which DFs enable) is that it
automatically optimizes DataFrame operations, letting you focus on _what_
you want while Spark will take care of figuring out _how_.

Tungsten takes things further by tightly managing memory using the type
information made available to it via DataFrames. This benefit comes into
play regardless of the language used.

So in short, DataFrames are the "new RDD"--i.e. the new base structure you
should be using in your Spark programs wherever possible. And with
DataFrames, what language you use matters much less in terms of performance.

Nick

On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:

> Hello Joshua,
>
> comments are inline...
>
> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
>
> I haven't used Spark in the last year and a half. I am about to start a
> project with a new team, and we need to decide whether to use pyspark or
> Scala.
>
>
> Indeed, good questions, and they do come up lot in trainings that I have
> attended, where this inevitable question is raised.
> I believe, it depends on your level of comfort zone or adventure into
> newer things.
>
> True, for the most part that Apache Spark committers have been committed
> to keep the APIs at parity across all the language offerings, even though
> in some cases, in particular Python, they have lagged by a minor release.
> To the the extent that they’re committed to level-parity is a good sign. It
> might to be the case with some experimental APIs, where they lag behind,
>  but for the most part, they have been admirably consistent.
>
> With Python there’s a minor performance hit, since there’s an extra level
> of indirection in the architecture and an additional Python PID that the
> executors launch to execute your pickled Python lambdas. Other than that it
> boils down to your comfort zone. I recommend looking at Sameer’s slides on
> (Advanced Spark for DevOps Training) where he walks through the pySpark and
> Python architecture.
>
>
> We are NOT a java shop. So some of the build tools/procedures will require
> some learning overhead if we go the Scala route. What I want to know is: is
> the Scala version of Spark still far enough ahead of pyspark to be well
> worth any initial training overhead?
>
>
> If you are a very advanced Python shop and if you’ve in-house libraries
> that you have written in Python that don’t exist in Scala or some ML libs
> that don’t exist in the Scala version and will require fair amount of
> porting and gap is too large, then perhaps it makes sense to stay put with
> Python.
>
> However, I believe, investing (or having some members of your group) learn
> and invest in Scala is worthwhile for few reasons. One, you will get the
> performance gain, especially now with Tungsten (not sure how it relates to
> Python, but some other knowledgeable people on the list, please chime in).
> Two, since Spark is written in Scala, it gives you an enormous advantage to
> read sources (which are well documented and highly readable) should you
> have to consult or learn nuances of certain API method or action not
> covered comprehensively in the docs. And finally, there’s a long term
> benefit in learning Scala for reasons other than Spark. For example,
> writing other scalable and distributed applications.
>
>
> Particularly, we will be using Spark Streaming. I know a couple of years
> ago that practically forced the decision to use Scala.  Is this still the
> case?
>
>
> You’ll notice that certain APIs call are not available, at least for now,
> in Python.
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>
>
> Cheers
> Jules
>
> --
> The Best Ideas Are Simple
> Jules S. Damji
> e-mail:dmat...@comcast.net
> e-mail:jules.da...@gmail.com
>
>


Re: Is this likely to cause any problems?

2016-02-19 Thread Nicholas Chammas
The docs mention spark-ec2 because it is part of the Spark project. There
are many, many alternatives to spark-ec2 out there like EMR, but it's
probably not the place of the official docs to promote any one of those
third-party solutions.

On Fri, Feb 19, 2016 at 11:05 AM James Hammerton  wrote:

> Hi,
>
> Having looked at how easy it is to use EMR, I reckon you may be right,
> especially if using Java 8 is no more difficult with that than with
> spark-ec2 (where I had to install it on the master and slaves and edit the
> spark-env.sh).
>
> I'm now curious as to why the Spark documentation (
> http://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR.
>
> Regards,
>
> James
>
>
> On 19 February 2016 at 14:25, Daniel Siegmann  > wrote:
>
>> With EMR supporting Spark, I don't see much reason to use the spark-ec2
>> script unless it is important for you to be able to launch clusters using
>> the bleeding edge version of Spark. EMR does seem to do a pretty decent job
>> of keeping up to date - the latest version (4.3.0) supports the latest
>> Spark version (1.6.0).
>>
>> So I'd flip the question around and ask: is there any reason to continue
>> using the spark-ec2 script rather than EMR?
>>
>> On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton  wrote:
>>
>>> I have now... So far  I think the issues I've had are not related to
>>> this, but I wanted to be sure in case it should be something that needs to
>>> be patched. I've had some jobs run successfully but this warning appears in
>>> the logs.
>>>
>>> Regards,
>>>
>>> James
>>>
>>> On 18 February 2016 at 12:23, Ted Yu  wrote:
>>>
 Have you seen this ?

 HADOOP-10988

 Cheers

 On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton 
 wrote:

> HI,
>
> I am seeing warnings like this in the logs when I run Spark jobs:
>
> OpenJDK 64-Bit Server VM warning: You have loaded library 
> /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have 
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c 
> ', or link it with '-z noexecstack'.
>
>
> I used spark-ec2 to launch the cluster with the default AMI, Spark
> 1.5.2, hadoop major version 2.4. I altered the jdk to be openjdk 8 as I'd
> written some jobs in Java 8. The 6 workers nodes are m4.2xlarge and master
> is m4.large.
>
> Could this contribute to any problems running the jobs?
>
> Regards,
>
> James
>


>>>
>>
>


Re: Is spark-ec2 going away?

2016-01-27 Thread Nicholas Chammas
I noticed that in the main branch, the ec2 directory along with the
spark-ec2 script is no longer present.

It’s been moved out of the main repo to its own location:
https://github.com/amplab/spark-ec2/pull/21

Is spark-ec2 going away in the next release? If so, what would be the best
alternative at that time?

It’s not going away. It’s just being removed from the main Spark repo and
maintained separately.

There are many alternatives like EMR, which was already mentioned, as well
as more full-service solutions like Databricks. It depends on what you’re
looking for.

If you want something as close to spark-ec2 as possible but more actively
developed, you might be interested in checking out Flintrock
, which I built.

Is there any way to add/remove additional workers while the cluster is
running without stopping/starting the EC2 cluster?

Not currently possible with spark-ec2 and a bit difficult to add. See:
https://issues.apache.org/jira/browse/SPARK-2008

For 1, if no such capability is provided with the current script., do we
have to write it ourselves? Or is there any plan in the future to add such
functions?

No "official" plans to add this to spark-ec2. It’s up to a contributor to
step up and implement this feature, basically. Otherwise it won’t happen.

Nick

On Wed, Jan 27, 2016 at 5:13 PM Alexander Pivovarov 
wrote:

you can use EMR-4.3.0 run on spot instances to control the price
>
> yes, you can add/remove instances to the cluster on fly  (CORE instances
> support add only, TASK instances - add and remove)
>
>
>
> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung  > wrote:
>
>> I noticed that in the main branch, the ec2 directory along with the
>> spark-ec2 script is no longer present.
>>
>> Is spark-ec2 going away in the next release? If so, what would be the
>> best alternative at that time?
>>
>> A couple more additional questions:
>> 1. Is there any way to add/remove additional workers while the cluster is
>> running without stopping/starting the EC2 cluster?
>> 2. For 1, if no such capability is provided with the current script., do
>> we have to write it ourselves? Or is there any plan in the future to add
>> such functions?
>> 2. In PySpark, is it possible to dynamically change driver/executor
>> memory, number of cores per executor without having to restart it? (e.g.
>> via changing sc configuration or recreating sc?)
>>
>> Our ideal scenario is to keep running PySpark (in our case, as a
>> notebook) and connect/disconnect to any spark clusters on demand.
>>
>
> ​


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL <https://docs.python.org/3/license.html>:
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>>> wrote:
>>>
>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>> imagine that they're also capable of installing a standalone Python
>>>> alongside that Spark version (without changing Python systemwide). For
>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>> require any special permissions to install (you don't need root / sudo
>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>
>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
>>>> wrote:
>>>>
>>>>> yeah, the practical concern is that we have no control over java or
>>>>> python version on large company clusters. our current reality for the vast
>>>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>>>
>>>>> i dont like it either, but i cannot change it.
>>>>>
>>>>> we currently don't use pyspark so i have no stake in this, but if we
>>>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>>>> dropped. no point in developing something that doesnt run for majority of
>>>>> customers.
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>>>> until 2020. So I'm assuming these large companies will have the option of
>>>>>> riding out Python 2.6 until then.
>>>>>>
>>>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>>>> for the next several years? Even though the core Python devs stopped
>>>>>> supporting it in 2013?
>>>>>>
>>>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>>>> support? What are the criteria?
>>>>>>
>>>>>> I understand the practical concern here. If companies are stuck using
>>>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>>>> concern against the maintenance burden on this project, I would say that
>>>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>>>>>> to
>>>>>> take. There are many tiny annoyances one has to put up with to 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL <https://docs.python.org/3/license.html>:

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick
​

On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>> until 2020. So I'm assuming these large companies will have the option of
>>>> riding out Python 2.6 until then.
>>>>
>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>> for the next several years? Even though the core Python devs stopped
>>>> supporting it in 2013?
>>>>
>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>> support? What are the criteria?
>>>>
>>>> I understand the practical concern here. If companies are stuck using
>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>> concern against the maintenance burden on this project, I would say that
>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>>
>>>> I suppose if our main PySpark contributors are fine putting up with
>>>> those annoyances, then maybe we don't need to drop support just yet...
>>>>
>>>> Nick
>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>>>> 작성:
>>>>
>>>>> Unfortunately, Koert is right.
>>>>>
>>>>> I've been in a couple of projects using Spark (banking industry) where
>>>>> CentOS + Python 2.6 is the toolbox available.
>>>>>
>>>>> That said, I believe it should not be a concern for Spark. Python 2.6
>>>>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>>>>
>>>>>
>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>>>>> escribió:
>>>>>
>>>>> rhel/centos 6 ships with python 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland  > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>>> wrote:
>>>
>>>> plus 1,
>>>>
>>>> we are currently using python 2.7.2 in production environment.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>>>
>>>> +1
>>>> We use Python 2.7
>>>>
>>>> Regards,
>>>>
>>>> Meethu Mathew
>>>>
>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>>>> wrote:
>>>>
>>>>> Does anybody here care about us dropping support for Python 2.6 in
>>>>> Spark 2.0?
>>>>>
>>>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>>>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>>>>> on
>>>>> stopped supporting 2.6. We can still convince the library maintainers to
>>>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>>>> Python 2.6 to run Spark.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>


Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Nicholas Chammas
Quick question: Are you processing gzipped files by any chance? It's a
common stumbling block people hit.

See: http://stackoverflow.com/q/27531816/877069

Nick

On Fri, Dec 4, 2015 at 2:28 PM Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Adding more slaves to a running cluster

2015-11-25 Thread Nicholas Chammas
spark-ec2 does not directly support adding instances to an existing
cluster, apart from the special case of adding slaves to a cluster with a
master but no slaves. There is an open issue to track adding this support,
SPARK-2008 , but it
doesn't have any momentum at the moment.

Your best bet currently is to do what you did and hack your way through
using spark-ec2's various scripts.

You probably already know this, but to be clear, note that Spark itself
supports adding slaves to a running cluster. It's just that spark-ec2
hasn't implemented a feature to do this work for you.

Nick

On Wed, Nov 25, 2015 at 2:27 PM Dillian Murphey 
wrote:

> It appears start-slave.sh works on a running cluster.  I'm surprised I
> can't find more info on this. Maybe I'm not looking hard enough?
>
> Using AWS and spot instances is incredibly more efficient, which begs for
> the need of dynamically adding more nodes while the cluster is up, yet
> everything I've found so far seems to indicate it isn't supported yet.
>
> But yet here I am with 1.5 and it at least appears to be working. Am I
> missing something?
>
> On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
> wrote:
>
>> What's the current status on adding slaves to a running cluster?  I want
>> to leverage spark-ec2 and autoscaling groups.  I want to launch slaves as
>> spot instances when I need to do some heavy lifting, but I don't want to
>> bring down my cluster in order to add nodes.
>>
>> Can this be done by just running start-slave.sh??
>>
>> What about using Mesos?
>>
>> I just want to create an AMI for a slave and on some trigger launch it
>> and have it automatically add itself to the cluster.
>>
>> thanks
>>
>
>


Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-23 Thread Nicholas Chammas
Don't the Hadoop builds include Hive already? Like
spark-1.5.2-bin-hadoop2.6.tgz?

On Mon, Nov 23, 2015 at 7:49 PM Jeff Schecter  wrote:

> Hi all,
>
> As far as I can tell, the bundled spark-ec2 script provides no way to
> launch a cluster running Spark 1.5.2 pre-built with HIVE.
>
> That is to say, all of the pre-build versions of Spark 1.5.2 in the s3 bin
> spark-related-packages are missing HIVE.
>
> aws s3 ls s3://spark-related-packages/ | grep 1.5.2
>
>
> Am I missing something here? I'd rather avoid resorting to whipping up
> hacky patching scripts that might break with the next Spark point release
> if at all possible.
>


Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Nicholas Chammas
spark-ec2 does not offer a way to upgrade an existing cluster, and from
what I gather, it wasn't intended to be used to manage long-lasting
infrastructure. The recommended approach really is to just destroy your
existing cluster and launch a new one with the desired configuration.

If you want to upgrade the cluster in place, you'll probably have to do
that manually. Otherwise, perhaps spark-ec2 is not the right tool, and
instead you want one of those "grown-up" management tools like Ansible
which can be setup to allow in-place upgrades. That'll take a bit of work,
though.

Nick

On Wed, Nov 11, 2015 at 6:01 PM Augustus Hong 
wrote:

> Hey All,
>
> I have a Spark cluster(running version 1.5.0) on EC2 launched with the
> provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
> cluster, what's the safest / recommended way to do that?
>
>
> I know I can spin up a new cluster running 1.5.2, but it doesn't seem
> efficient to spin up a new cluster every time we need to upgrade.
>
>
> Thanks,
> Augustus
>
>
>
>
>
> --
> [image: Branch Metrics mobile deep linking] * Augustus
> Hong*
>  Data Analytics | Branch Metrics
>  m 650-391-3369 | e augus...@branch.io
>


Re: Spark EC2 script on Large clusters

2015-11-05 Thread Nicholas Chammas
Yeah, as Shivaram mentioned, this issue is well-known. It's documented in
SPARK-5189  and a bunch
of related issues. Unfortunately, it's hard to resolve this issue in
spark-ec2 without rewriting large parts of the project. But if you take a
crack at it and succeed I'm sure a lot of people will be happy.

I've started a separate project  --
which Shivaram also mentioned -- which aims to solve the problem of long
launch times and other issues
 with spark-ec2. It's
still very young and lacks several critical features, but we are making
steady progress.

Nick

On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> It is a known limitation that spark-ec2 is very slow for large
> clusters and as you mention most of this is due to the use of rsync to
> transfer things from the master to all the slaves.
>
> Nick cc'd has been working on an alternative approach at
> https://github.com/nchammas/flintrock that is more scalable.
>
> Thanks
> Shivaram
>
> On Thu, Nov 5, 2015 at 8:12 AM, Christian  wrote:
> > For starters, thanks for the awesome product!
> >
> > When creating ec2-clusters of 20-40 nodes, things work great. When we
> create
> > a cluster with the provided spark-ec2 script, it takes hours. When
> creating
> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it
> takes
> > over 5 hours. One other problem we are having is that some nodes don't
> come
> > up when the other ones do, the process seems to just move on, skipping
> the
> > rsync and any installs on those ones.
> >
> > My guess as to why it takes so long to set up a large cluster is because
> of
> > the use of rsync. What if instead of using rsync, you synched to s3 and
> then
> > did a pdsh to pull it down on all of the machines. This is a big deal
> for us
> > and if we can come up with a good plan, we might be able help out with
> the
> > required changes.
> >
> > Are there any suggestions on how to deal with some of the nodes not being
> > ready when the process starts?
> >
> > Thanks for your time,
> > Christian
> >
>


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Nicholas Chammas
Nabble is an unofficial archive of this mailing list. I don't know who runs
it, but it's not Apache. There are often delays between when things get
posted to the list and updated on Nabble, and sometimes things never make
it over for whatever reason.

This mailing list is, I agree, very 1980s. Unfortunately, it's required by
the Apache Software Foundation (ASF).

There was a discussion earlier this year

about
migrating to Discourse that explained why we're stuck with what we have for
now. Ironically, that discussion is hard to follow on the Apache archives
(which is precisely one of the motivations for proposing to migrate to
Discourse), but there is a more readable archive on another unofficial site

.

Nick

On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
wrote:

> Having written a post on last Tuesday, I'm still not able to see my post
> under nabble. And yeah, subscription to u...@apache.spark.org was
> successful (rechecked a minute ago)
>
> Even more, I have no way (and no confirmation) that my post was accepted,
> rejected, whatever.
>
> This is very L4M3 and so 80ies.
>
> Any help appreciated. Thx!
>


Can we add an unsubscribe link in the footer of every email?

2015-10-21 Thread Nicholas Chammas
Every week or so someone emails the list asking to unsubscribe.

Of course, that's not the right way to do it. You're supposed to email
a different
address  than this one to
unsubscribe, yet this is not in-your-face obvious, so many people miss it.
And someone steps up almost every time to point people in the right
direction.

The vast majority of mailing lists I'm familiar with include a small footer
at the bottom of each email with a link to unsubscribe. I think this is
what most people expect, and it's where they check first.

Can we add a footer like that?

I think it would cut down on the weekly emails from people wanting to
unsubscribe, and it would match existing mailing list conventions elsewhere.

Nick


Re: stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread Nicholas Chammas
The Spark 1.4 release notes
 say that
Python 3 is supported. The 1.4 docs are incorrect, and the 1.5 programming
guide has been updated to indicate Python 3 support.

On Wed, Oct 14, 2015 at 7:06 AM shoira.mukhsin...@bnpparibasfortis.com <
shoira.mukhsin...@bnpparibasfortis.com> wrote:

> Dear Spark Community,
>
>
>
> The official documentation of Spark 1.4.1 mentions that Spark runs on Python
> 2.6+ http://spark.apache.org/docs/1.4.1/
>
> It is not clear if by “Python 2.6+” do you also mean Python 3.4 or not.
>
>
>
> There is a resolved issue on this point which makes me believe that it
> does run on Python 3.4: https://issues.apache.org/jira/i#browse/SPARK-9705
>
> Maybe the documentation is simply not up to date ? The programming guide
> mentions that it does not work for Python 3:
> https://spark.apache.org/docs/1.4.1/programming-guide.html
>
>
>
> Do you confirm that Spark 1.4.1 does run on Python3.4?
>
>
>
> Thanks in advance for your reaction!
>
>
>
> Regards,
>
> Shoira
>
>
>
>
>
>
>
> ==
> BNP Paribas Fortis disclaimer:
> http://www.bnpparibasfortis.com/e-mail-disclaimer.html
>
> BNP Paribas Fortis privacy policy:
> http://www.bnpparibasfortis.com/privacy-policy.html
>
> ==
>


Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
Hi Everybody!

Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:

https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics

The gist of the results is as follows:

Most people found spark-ec2 useful as an easy way to get a working Spark
cluster to run a quick experiment or do some benchmarking without having to
do a lot of manual configuration or setup work.

Many people lamented the slow launch times of spark-ec2, problems getting
it to launch clusters within a VPC, and broken Ganglia installs. Some also
mentioned that Hadoop 2 didn't work as expected.

Wish list items for spark-ec2 included faster launches, selectable Hadoop 2
versions, and more configuration options.

If you'd like to add your own feedback to what's already there, I've
decided to leave the survey open for a few more days:

http://goo.gl/forms/erct2s6KRR

As noted before, your results are anonymous and public.

Thanks again for participating! I hope this has been useful to the
community.

Nick

On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas 
wrote:

> Final chance to fill out the survey!
>
> http://goo.gl/forms/erct2s6KRR
>
> I'm gonna close it to new responses tonight and send out a summary of the
> results.
>
> Nick
>
> On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'm planning to close the survey to further responses early next week.
>>
>> If you haven't chimed in yet, the link to the survey is here:
>>
>> http://goo.gl/forms/erct2s6KRR
>>
>> We already have some great responses, which you can view. I'll share a
>> summary after the survey is closed.
>>
>> Cheers!
>>
>> Nick
>>
>>
>> On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Howdy folks!
>>>
>>> I’m interested in hearing about what people think of spark-ec2
>>> <http://spark.apache.org/docs/latest/ec2-scripts.html> outside of the
>>> formal JIRA process. Your answers will all be anonymous and public.
>>>
>>> If the embedded form below doesn’t work for you, you can use this link
>>> to get the same survey:
>>>
>>> http://goo.gl/forms/erct2s6KRR
>>>
>>> Cheers!
>>> Nick
>>> ​
>>>
>>


Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas 
wrote:

> I'm planning to close the survey to further responses early next week.
>
> If you haven't chimed in yet, the link to the survey is here:
>
> http://goo.gl/forms/erct2s6KRR
>
> We already have some great responses, which you can view. I'll share a
> summary after the survey is closed.
>
> Cheers!
>
> Nick
>
>
> On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Howdy folks!
>>
>> I’m interested in hearing about what people think of spark-ec2
>> <http://spark.apache.org/docs/latest/ec2-scripts.html> outside of the
>> formal JIRA process. Your answers will all be anonymous and public.
>>
>> If the embedded form below doesn’t work for you, you can use this link to
>> get the same survey:
>>
>> http://goo.gl/forms/erct2s6KRR
>>
>> Cheers!
>> Nick
>> ​
>>
>


Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas
I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick


On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Howdy folks!
>
> I’m interested in hearing about what people think of spark-ec2
> <http://spark.apache.org/docs/latest/ec2-scripts.html> outside of the
> formal JIRA process. Your answers will all be anonymous and public.
>
> If the embedded form below doesn’t work for you, you can use this link to
> get the same survey:
>
> http://goo.gl/forms/erct2s6KRR
>
> Cheers!
> Nick
> ​
>


[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas
Howdy folks!

I’m interested in hearing about what people think of spark-ec2
 outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link to
get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick
​


Re: spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread Nicholas Chammas
You refer to `aws_security_token`, but I'm not sure where you're specifying
it. Can you elaborate? Is it an environment variable?

On Mon, Jul 27, 2015 at 4:21 AM Jan Zikeš  wrote:

> Hi,
>
> I would like to ask if it is currently possible to use spark-ec2 script
> together with credentials that are consisting not only from:
> aws_access_key_id and aws_secret_access_key, but it also contains
> aws_security_token.
>
> When I try to run the script I am getting following error message:
>
> ERROR:boto:Caught exception reading instance data
> Traceback (most recent call last):
>   File "/Users/zikes/opensource/spark/ec2/lib/boto-2.34.0/boto/utils.py",
> line 210, in retry_url
> r = opener.open(req, timeout=timeout)
>   File
>
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
> line 404, in open
> response = self._open(req, data)
>   File
>
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
> line 422, in _open
> '_open', req)
>   File
>
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
> line 382, in _call_chain
> result = func(*args)
>   File
>
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
> line 1214, in http_open
> return self.do_open(httplib.HTTPConnection, req)
>   File
>
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
> line 1184, in do_open
> raise URLError(err)
> URLError: 
> ERROR:boto:Unable to read instance data, giving up
> No handler was ready to authenticate. 1 handlers were checked.
> ['QuerySignatureV2AuthHandler'] Check your credentials
>
> Does anyone has some idea what can be possibly wrong? Is aws_security_token
> the problem?
> I know that it seems more like a boto problem, but still I would like to
> ask
> if anybody has some experience with this?
>
> My launch command is:
> ./spark-ec2 -k my_key -i my_key.pem --additional-tags
> "mytag:tag1,mytag2:tag2" --instance-profile-name "profile1" -s 1 launch
> test
>
> Thank you in advance for any help.
> Best regards,
>
> Jan
>
> Note:
> I have also asked at
>
> http://stackoverflow.com/questions/31583513/spark-spark-ec2-credentials-using-aws-security-token?noredirect=1#comment51151822_31583513
> without any success.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-spark-ec2-credentials-using-aws-security-token-tp24007.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread Nicholas Chammas
No plans to change that at the moment, but agreed it is against accepted
convention. It would be a lot of work to change the tool, change the AMIs,
and test everything. My suggestion is not to hold your breath for such a
change.

spark-ec2, as far as I understand, is not intended for spinning up
permanent or production infrastructure (though people may use it for those
purposes), so there isn't a big impetus to fix this kind of issue. It works
really well for what it was intended for: spinning up clusters for testing,
prototyping, and experimenting.

Nick

On Thu, Jul 9, 2015 at 3:25 AM matd  wrote:

> Hi,
>
> Spark ec2 scripts are useful, but they install everything as root.
> AFAIK, it's not a good practice ;-)
>
> Why is it so ?
> Should these scripts reserved for test/demo purposes, and not to be used
> for
> a production system ?
> Is it planned in some roadmap to improve that, or to replace ec2-scripts
> with something else ?
>
> Would it be difficult to change them to use a sudo-er instead ?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-as-non-root-any-plan-to-improve-that-in-the-future-tp23734.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
I would test it against 1.3 to be sure, because it could -- though unlikely
-- be a regression. For example, I recently stumbled upon this issue
<https://issues.apache.org/jira/browse/SPARK-8670> which was specific to
1.4.

On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl  wrote:

> I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
> code would be failing right now.
>
> On Saturday, June 27, 2015, Nicholas Chammas 
> wrote:
>
>> Yeah, you shouldn't have to rename the columns before joining them.
>>
>> Do you see the same behavior on 1.3 vs 1.4?
>>
>> Nick
>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl 님이 작성:
>>
>>> still feels like a bug to have to create unique names before a join.
>>>
>>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha  wrote:
>>>
>>>> You can declare the schema with unique names before creation of df.
>>>> On 27 Jun 2015 13:01, "Axel Dahl"  wrote:
>>>>
>>>>>
>>>>> I have the following code:
>>>>>
>>>>> from pyspark import SQLContext
>>>>>
>>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
>>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>>>>
>>>>> r1 = sc.parallelize(d1)
>>>>> r2 = sc.parallelize(d2)
>>>>>
>>>>> sqlContext = SQLContext(sc)
>>>>> df1 = sqlContext.createDataFrame(d1)
>>>>> df2 = sqlContext.createDataFrame(d2)
>>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>>>> 'left_outer').collect()
>>>>>
>>>>>
>>>>> When I run it I get the following, (notice in the first row, all join
>>>>> keys are take from the right-side and so are blanked out):
>>>>>
>>>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>>>> name=None),
>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>>> name=u'bob'),
>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>> country=u'ire', name=u'alice')]
>>>>>
>>>>> I would expect to get (though ideally without duplicate columns):
>>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>>>>> name=None),
>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>>>> name=u'bob'),
>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>> country=u'ire', name=u'alice')]
>>>>>
>>>>> The workaround for now is this rather clunky piece of code:
>>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>>>> 'name2').withColumnRenamed('country', 'country2')
>>>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>>>>> 'left_outer').collect()
>>>>>
>>>>> So to me it looks like a bug, but am I doing something wrong?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Axel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>


Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
Yeah, you shouldn't have to rename the columns before joining them.

Do you see the same behavior on 1.3 vs 1.4?

Nick
2015년 6월 27일 (토) 오전 2:51, Axel Dahl 님이 작성:

> still feels like a bug to have to create unique names before a join.
>
> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha  wrote:
>
>> You can declare the schema with unique names before creation of df.
>> On 27 Jun 2015 13:01, "Axel Dahl"  wrote:
>>
>>>
>>> I have the following code:
>>>
>>> from pyspark import SQLContext
>>>
>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
>>> 'country': 'ire', 'colour':'green'}]
>>>
>>> r1 = sc.parallelize(d1)
>>> r2 = sc.parallelize(d2)
>>>
>>> sqlContext = SQLContext(sc)
>>> df1 = sqlContext.createDataFrame(d1)
>>> df2 = sqlContext.createDataFrame(d2)
>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>> 'left_outer').collect()
>>>
>>>
>>> When I run it I get the following, (notice in the first row, all join
>>> keys are take from the right-side and so are blanked out):
>>>
>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>> name=None),
>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>> name=u'bob'),
>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>> country=u'ire', name=u'alice')]
>>>
>>> I would expect to get (though ideally without duplicate columns):
>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>>> name=None),
>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
>>> name=u'bob'),
>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>> country=u'ire', name=u'alice')]
>>>
>>> The workaround for now is this rather clunky piece of code:
>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>> 'name2').withColumnRenamed('country', 'country2')
>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>>> 'left_outer').collect()
>>>
>>> So to me it looks like a bug, but am I doing something wrong?
>>>
>>> Thanks,
>>>
>>> -Axel
>>>
>>>
>>>
>>>
>>>
>


Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas
If your problem is that stopping/starting the cluster resets configs, then
you may be running into this issue:

https://issues.apache.org/jira/browse/SPARK-4977

Nick

On Thu, Jun 4, 2015 at 2:46 PM barmaley  wrote:

> Hi - I'm having similar problem with switching from ephemeral to persistent
> HDFS - it always looks for 9000 port regardless of options I set for 9010
> persistent HDFS. Have you figured out a solution? Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-20 Thread Nicholas Chammas
To put this on the devs' radar, I suggest creating a JIRA for it (and
checking first if one already exists).

issues.apache.org/jira/

Nick

On Tue, May 19, 2015 at 1:34 PM Matei Zaharia 
wrote:

> Yeah, this definitely seems useful there. There might also be some ways to
> cap the application in Mesos, but I'm not sure.
>
> Matei
>
> On May 19, 2015, at 1:11 PM, Thomas Dudziak  wrote:
>
> I'm using fine-grained for a multi-tenant environment which is why I would
> welcome the limit of tasks per job :)
>
> cheers,
> Tom
>
> On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia 
> wrote:
>
>> Hey Tom,
>>
>> Are you using the fine-grained or coarse-grained scheduler? For the
>> coarse-grained scheduler, there is a spark.cores.max config setting that
>> will limit the total # of cores it grabs. This was there in earlier
>> versions too.
>>
>> Matei
>>
>> > On May 19, 2015, at 12:39 PM, Thomas Dudziak  wrote:
>> >
>> > I read the other day that there will be a fair number of improvements
>> in 1.4 for Mesos. Could I ask for one more (if it isn't already in there):
>> a configurable limit for the number of tasks for jobs run on Mesos ? This
>> would be a very simple yet effective way to prevent a job dominating the
>> cluster.
>> >
>> > cheers,
>> > Tom
>> >
>>
>>
>
>


Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
.

It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of doing this. Haven't
used it though.

On Thu, May 7, 2015 at 5:20 PM alemagnani  wrote:

> I am currently using pyspark with a virtualenv.
> Unfortunately I don't have access to the nodes file system and therefore I
> cannot  manually copy the virtual env over there.
>
> I have been using this technique:
>
> I first add a tar ball with the venv
> sc.addFile(virtual_env_tarball_file)
>
> Then in the code used on the node to do the computation I activate the venv
> like this:
> venv_location = SparkFiles.get(venv_name)
> activate_env="%s/bin/activate_this.py" % venv_location
> execfile(activate_env, dict(__file__=activate_env))
>
> Is there a better way to do this?
> One of the problem with this approach is that in
> spark/python/pyspark/statcounter.py numpy is imported
> before the venv is activated and this can cause conflicts with the venv
> numpy.
>
> Moreover this requires the venv to be sent around in the cluster all the
> time.
> Any suggestions?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to deploy self-build spark source code on EC2

2015-04-28 Thread Nicholas Chammas
[-dev] [+user]

This is a question for the user list, not the dev list.

Use the --spark-version and --spark-git-repo options to specify your own
repo and hash to deploy.

Source code link.


Nick

On Tue, Apr 28, 2015 at 12:14 PM Bo Fu b...@uchicago.edu
 wrote:

Hi all,
>
> I have an issue. I added some timestamps in Spark source code and built it
> using:
>
> mvn package -DskipTests
>
> I checked the new version in my own computer and it works. However, when I
> ran spark on EC2, the spark code EC2 machines ran is the original version.
>
> Anyone knows how to deploy the changed spark source code into EC2?
> Thx a lot
>
>
> Bo Fu
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>  ​


Re: Querying Cluster State

2015-04-26 Thread Nicholas Chammas
The Spark web UI offers a JSON interface with some of this information.

http://stackoverflow.com/a/29659630/877069

It's not an official API, so be warned that it may change unexpectedly
between versions, but you might find it helpful.

Nick

On Sun, Apr 26, 2015 at 9:46 AM michal.klo...@gmail.com <
michal.klo...@gmail.com> wrote:

> Not sure if there's a spark native way but we've been using consul for
> this.
>
> M
>
>
>
> On Apr 26, 2015, at 5:17 AM, James King  wrote:
>
> Thanks for the response.
>
> But no this does not answer the question.
>
> The question was: Is there a way (via some API call) to query the number
> and type of daemons currently running in the Spark cluster.
>
> Regards
>
>
> On Sun, Apr 26, 2015 at 10:12 AM, ayan guha  wrote:
>
>> In my limited understanding, there must be single   "leader" master  in
>> the cluster. If there are multiple leaders, it will lead to unstable
>> cluster as each masters will keep scheduling independently. You should use
>> zookeeper for HA, so that standby masters can vote to find new leader if
>> the primary goes down.
>>
>> Now, you can still have multiple masters running as leaders but
>> conceptually they should be thought as different clusters.
>>
>> Regarding workers, they should follow their master.
>>
>> Not sure if this answers your question, as I am sure you have read the
>> documentation thoroughly.
>>
>> Best
>> Ayan
>>
>> On Sun, Apr 26, 2015 at 6:31 PM, James King 
>> wrote:
>>
>>> If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each
>>> node, so in total I will have 5 master and 10 Workers.
>>>
>>> Now to maintain that setup I would like to query spark regarding the
>>> number Masters and Workers that are currently available using API calls and
>>> then take some appropriate action based on the information I get back, like
>>> restart a dead Master or Worker.
>>>
>>> Is this possible? does Spark provide such API?
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Yes, that is mostly why these third-party sites have sprung up around the
official archives--to provide better search. Did you try the link Ted
posted?

On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg 
wrote:

> It seems that those archives are not necessarily easy to find stuff in. Is
> there a search engine on top of them? so as to find e.g. your own posts
> easily?
>
> On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Sure, you can use Nabble or search-hadoop or whatever you prefer.
>>
>> My point is just that the source of truth are the Apache archives, and
>> these other sites may or may not be in sync with that truth.
>>
>> On Thu, Mar 19, 2015 at 10:20 AM Ted Yu  wrote:
>>
>>> I prefer using search-hadoop.com which provides better search
>>> capability.
>>>
>>> Cheers
>>>
>>> On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Nabble is a third-party site that tries its best to archive mail sent
>>>> out over the list. Nothing guarantees it will be in sync with the real
>>>> mailing list.
>>>>
>>>> To get the "truth" on what was sent over this, Apache-managed list, you
>>>> unfortunately need to go the Apache archives:
>>>> http://mail-archives.apache.org/mod_mbox/spark-user/
>>>>
>>>> Nick
>>>>
>>>> On Thu, Mar 19, 2015 at 5:18 AM Ted Yu  wrote:
>>>>
>>>>> There might be some delay:
>>>>>
>>>>>
>>>>> http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
>>>>>
>>>>>
>>>>> On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg <
>>>>> dgoldenberg...@gmail.com> wrote:
>>>>>
>>>>> Thanks, Ted. Well, so far even there I'm only seeing my post and not,
>>>>> for example, your response.
>>>>>
>>>>> On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu  wrote:
>>>>>
>>>>>> Was this one of the threads you participated ?
>>>>>> http://search-hadoop.com/m/JW1q5w0p8x1
>>>>>>
>>>>>> You should be able to find your posts on search-hadoop.com
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg <
>>>>>> dgoldenberg...@gmail.com> wrote:
>>>>>>
>>>>>>> Sorry if this is a total noob question but is there a reason why I'm
>>>>>>> only
>>>>>>> seeing folks' responses to my posts in emails but not in the browser
>>>>>>> view
>>>>>>> under apache-spark-user-list.1001560.n3.nabble.com?  Is this a
>>>>>>> matter of
>>>>>>> setting your preferences such that your responses only go to email
>>>>>>> and never
>>>>>>> to the browser-based view of the list? I don't seem to see such a
>>>>>>> preference...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>


Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Sure, you can use Nabble or search-hadoop or whatever you prefer.

My point is just that the source of truth are the Apache archives, and
these other sites may or may not be in sync with that truth.

On Thu, Mar 19, 2015 at 10:20 AM Ted Yu  wrote:

> I prefer using search-hadoop.com which provides better search capability.
>
> Cheers
>
> On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Nabble is a third-party site that tries its best to archive mail sent out
>> over the list. Nothing guarantees it will be in sync with the real mailing
>> list.
>>
>> To get the "truth" on what was sent over this, Apache-managed list, you
>> unfortunately need to go the Apache archives:
>> http://mail-archives.apache.org/mod_mbox/spark-user/
>>
>> Nick
>>
>> On Thu, Mar 19, 2015 at 5:18 AM Ted Yu  wrote:
>>
>>> There might be some delay:
>>>
>>>
>>> http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
>>>
>>>
>>> On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg 
>>> wrote:
>>>
>>> Thanks, Ted. Well, so far even there I'm only seeing my post and not,
>>> for example, your response.
>>>
>>> On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu  wrote:
>>>
>>>> Was this one of the threads you participated ?
>>>> http://search-hadoop.com/m/JW1q5w0p8x1
>>>>
>>>> You should be able to find your posts on search-hadoop.com
>>>>
>>>> On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg 
>>>> wrote:
>>>>
>>>>> Sorry if this is a total noob question but is there a reason why I'm
>>>>> only
>>>>> seeing folks' responses to my posts in emails but not in the browser
>>>>> view
>>>>> under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter
>>>>> of
>>>>> setting your preferences such that your responses only go to email and
>>>>> never
>>>>> to the browser-based view of the list? I don't seem to see such a
>>>>> preference...
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>


Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Nabble is a third-party site that tries its best to archive mail sent out
over the list. Nothing guarantees it will be in sync with the real mailing
list.

To get the "truth" on what was sent over this, Apache-managed list, you
unfortunately need to go the Apache archives:
http://mail-archives.apache.org/mod_mbox/spark-user/

Nick

On Thu, Mar 19, 2015 at 5:18 AM Ted Yu  wrote:

> There might be some delay:
>
>
> http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
>
>
> On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg 
> wrote:
>
> Thanks, Ted. Well, so far even there I'm only seeing my post and not, for
> example, your response.
>
> On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu  wrote:
>
>> Was this one of the threads you participated ?
>> http://search-hadoop.com/m/JW1q5w0p8x1
>>
>> You should be able to find your posts on search-hadoop.com
>>
>> On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg 
>> wrote:
>>
>>> Sorry if this is a total noob question but is there a reason why I'm only
>>> seeing folks' responses to my posts in emails but not in the browser view
>>> under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter of
>>> setting your preferences such that your responses only go to email and
>>> never
>>> to the browser-based view of the list? I don't seem to see such a
>>> preference...
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows:

lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)

For more details on why, see this answer
.

Nick
​

On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier  wrote:

> 1. I don't think textFile is capable of unpacking a .gz file. You need to
> use hadoopFile or newAPIHadoop file for this.
>
>
> Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do
> is compute splits on gz files, so if you have a single file, you'll have a
> single partition.
>
> Processing 30 GB of gzipped data should not take that long, at least with
> the Scala API. Python not sure, especially under 1.2.1.
>
>


Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-23 Thread Nicholas Chammas
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work…

Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for
spark-ec2. No other AMI will work unless it was built with that goal in
mind. Using a random AMI from the Amazon marketplace is unlikely to work
because there are several tools and packages (e.g. like git) that need to
be on the AMI.

Furthermore, the spark-ec2 scripts all assume a yum-based Linux
distribution, so you won’t be able to use Ubuntu (and apt-get-based distro)
without some significant changes to the shell scripts used to build the AMI.

There is some work ongoing as part of SPARK-3821
 to make it easier to
generate AMIs that work with spark-ec2.

Nick
​

On Sun Feb 22 2015 at 7:42:52 PM Ted Yu  wrote:

> bq. bash: git: command not found
>
> Looks like the AMI doesn't have git pre-installed.
>
> Cheers
>
> On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh  wrote:
>
>> I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu)
>> using
>> the following:
>>
>> ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
>> --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
>> --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
>> spark-ubuntu-cluster
>>
>> Everything starts OK and instances are launched:
>>
>> Found 1 master(s), 2 slaves
>> Waiting for all instances in cluster to enter 'ssh-ready' state.
>> Generating cluster's SSH key on master.
>>
>> But then I'm getting the following SSH errors until it stops trying and
>> quits:
>>
>> bash: git: command not found
>> Connection to ***.us-west-2.compute.amazonaws.com closed.
>> Error executing remote command, retrying after 30 seconds: Command
>> '['ssh',
>> '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
>> 'UserKnownHostsFile=/dev/null', '-t', '-t',
>> u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 && git
>> clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero
>> exit
>> status 127
>>
>> I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
>> but still, it should work... Any advice would be greatly appreciated!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Posting to the list

2015-02-23 Thread Nicholas Chammas
Nabble is a third-party site. If you send stuff through Nabble, Nabble has
to forward it along to the Apache mailing list. If something goes wrong
with that, you will have a message show up on Nabble that no-one saw.

The reverse can also happen, where something actually goes out on the list
and doesn't make it to Nabble.

Nabble is a nicer, third-party interface to the Apache list archives. No
more. It works best for reading through old threads.

Apache is the source of truth. Post through there.

Unfortunately, this is what we're stuck with. For a related
discussion, see this
thread about Discourse

.

Nick

On Sun Feb 22 2015 at 8:07:08 PM haihar nahak  wrote:

> I checked it but I didn't see any mail from user list. Let me do it one
> more time.
>
> [image: Inline image 1]
>
> --Harihar
>
> On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu  wrote:
>
>> bq. i didnt get any new subscription mail in my inbox.
>>
>> Have you checked your Spam folder ?
>>
>> Cheers
>>
>> On Sun, Feb 22, 2015 at 2:36 PM, hnahak  wrote:
>>
>>> I'm also facing the same issue, this is third time whenever I post
>>> anything
>>> it never accept by the community and at the same time got a failure mail
>>> in
>>> my register mail id.
>>>
>>> and when click to "subscribe to this mailing list" link, i didnt get any
>>> new
>>> subscription mail in my inbox.
>>>
>>> Please anyone suggest a best way to subscribed the email ID
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> {{{H2N}}}-(@:
>


Re: SQLContext.applySchema strictness

2015-02-14 Thread Nicholas Chammas
Would it make sense to add an optional validate parameter to applySchema()
which defaults to False, both to give users the option to check the schema
immediately and to make the default behavior clearer?
​

On Sat Feb 14 2015 at 9:18:59 AM Michael Armbrust 
wrote:

> Doing runtime type checking is very expensive, so we only do it when
> necessary (i.e. you perform an operation like adding two columns together)
>
> On Sat, Feb 14, 2015 at 2:19 AM, nitin  wrote:
>
>> AFAIK, this is the expected behavior. You have to make sure that the
>> schema
>> matches the row. It won't give any error when you apply the schema as it
>> doesn't validate the nature of data.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650p21653.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
OK, good luck!

On Mon Feb 09 2015 at 6:41:14 PM Guodong Wang  wrote:

> Hi Nicholas,
>
> Thanks for your quick reply.
>
> I'd like to try to build a image with create_image.sh. Then let's see how
> we can launch spark cluster in region cn-north-1.
>
>
>
> Guodong
>
> On Tue, Feb 10, 2015 at 3:59 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Guodong,
>>
>> spark-ec2 does not currently support the cn-north-1 region, but you can
>> follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to
>> find out when it does.
>>
>> The base AMI used to generate the current Spark AMIs is very old. I'm not
>> sure anyone knows what it is anymore. What I know is that it is an Amazon
>> Linux AMI.
>>
>> Yes, the create_image.sh script is what is used to generate the current
>> Spark AMI.
>>
>> Nick
>>
>> On Mon Feb 09 2015 at 3:27:13 AM Franc Carter <
>> franc.car...@rozettatech.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> I'm very new to Spark, but  experienced with AWS - so take that in to
>>> account with my suggestions.
>>>
>>> I started with an AWS base image and then added the pre-built Spark-1.2.
>>> I then added made a 'Master' version and a 'Worker' versions and then made
>>> AMIs for them.
>>>
>>> The Master comes up with a static IP and the Worker image has this baked
>>> in. I haven't completed everything I am planning to do but so far I can
>>> bring up the Master and a bunch of Workers inside and ASG and run spark
>>> code successfully.
>>>
>>> cheers
>>>
>>>
>>> On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang 
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I want to launch spark cluster in AWS. And I know there is a
>>>> spark_ec2.py script.
>>>>
>>>> I am using the AWS service in China. But I can not find the AMI in the
>>>> region of China.
>>>>
>>>> So, I have to build one. My question is
>>>> 1. Where is the bootstrap script to create the Spark AMI? Is it here(
>>>> https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
>>>> 2. What is the base image of the Spark AMI? Eg, the base image of this (
>>>> https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm
>>>> )
>>>> 3. Shall I install scala during building the AMI?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Guodong
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Franc Carter* | Systems Architect | Rozetta Technology
>>>
>>> franc.car...@rozettatech.com  |
>>> www.rozettatechnology.com
>>>
>>> Tel: +61 2 8355 2515
>>>
>>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>>
>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>
>>> AUSTRALIA
>>>
>>>
>


Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
Guodong,

spark-ec2 does not currently support the cn-north-1 region, but you can
follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to
find out when it does.

The base AMI used to generate the current Spark AMIs is very old. I'm not
sure anyone knows what it is anymore. What I know is that it is an Amazon
Linux AMI.

Yes, the create_image.sh script is what is used to generate the current
Spark AMI.

Nick

On Mon Feb 09 2015 at 3:27:13 AM Franc Carter 
wrote:

>
> Hi,
>
> I'm very new to Spark, but  experienced with AWS - so take that in to
> account with my suggestions.
>
> I started with an AWS base image and then added the pre-built Spark-1.2. I
> then added made a 'Master' version and a 'Worker' versions and then made
> AMIs for them.
>
> The Master comes up with a static IP and the Worker image has this baked
> in. I haven't completed everything I am planning to do but so far I can
> bring up the Master and a bunch of Workers inside and ASG and run spark
> code successfully.
>
> cheers
>
>
> On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang  wrote:
>
>> Hi guys,
>>
>> I want to launch spark cluster in AWS. And I know there is a spark_ec2.py
>> script.
>>
>> I am using the AWS service in China. But I can not find the AMI in the
>> region of China.
>>
>> So, I have to build one. My question is
>> 1. Where is the bootstrap script to create the Spark AMI? Is it here(
>> https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
>> 2. What is the base image of the Spark AMI? Eg, the base image of this (
>> https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm
>> )
>> 3. Shall I install scala during building the AMI?
>>
>>
>> Thanks.
>>
>> Guodong
>>
>
>
>
> --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  |
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>


Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Thanks for sending this over, Peter.

What if you try this? (i.e. Remove the = after --identity-file.)

ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file
~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login
pz-spark-cluster

If that works, then I think the problem in this case is simply that Bash
cannot expand the tilde because it’s stuck to the --identity-file=. This
isn’t a problem with spark-ec2.

Bash sees the --identity-file=~/.pzkeys/spark-streaming-kp.pem as one big
argument, so it can’t do tilde expansion.

Nick
​

On Wed Jan 28 2015 at 9:17:06 PM Peter Zybrick  wrote:

> Below is trace from trying to access with ~/path.  I also did the echo as
> per Nick (see the last line), looks ok to me.  This is my development box
> with Spark 1.2.0 running CentOS 6.5, Python 2.6.6
>
> [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2
> --key-pair=spark-streaming-kp
> --identity-file=~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login
> pz-spark-cluster
> Searching for existing cluster pz-spark-cluster...
> Found 1 master(s), 3 slaves
> Logging into master ec2-54-152-95-129.compute-1.amazonaws.com...
> Warning: Identity file ~/.pzkeys/spark-streaming-kp.pem not accessible: No
> such file or directory.
> Permission denied (publickey).
> Traceback (most recent call last):
>   File "ec2/spark_ec2.py", line 1082, in 
> main()
>   File "ec2/spark_ec2.py", line 1074, in main
> real_main()
>   File "ec2/spark_ec2.py", line 1007, in real_main
> ssh_command(opts) + proxy_opt + ['-t', '-t', "%s@%s" % (opts.user,
> master)])
>   File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['ssh', '-o',
> 'StrictHostKeyChecking=no', '-i', '~/.pzkeys/spark-streaming-kp.pem', '-t',
> '-t', u'r...@ec2-54-152-95-129.compute-1.amazonaws.com']' returned
> non-zero exit status 255
> [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ echo
> ~/.pzkeys/spark-streaming-kp.pem
> /home/pete.zybrick/.pzkeys/spark-streaming-kp.pem
>
>
> On Wed, Jan 28, 2015 at 3:49 PM, Charles Feduke 
> wrote:
>
>> Yeah, I agree ~ should work. And it could have been [read: probably was]
>> the fact that one of the EC2 hosts was in my known_hosts (don't know, never
>> saw an error message, but the behavior is no error message for that state),
>> which I had fixed later with Pete's patch. But the second execution when
>> things worked with an absolute path could have worked because the random
>> hosts that came up on EC2 were never in my known_hosts.
>>
>>
>> On Wed Jan 28 2015 at 3:45:36 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hmm, I can’t see why using ~ would be problematic, especially if you
>>> confirm that echo ~/path/to/pem expands to the correct path to your
>>> identity file.
>>>
>>> If you have a simple reproduction of the problem, please send it over.
>>> I’d love to look into this. When I pass paths with ~ to spark-ec2 on my
>>> system, it works fine. I’m using bash, but zsh handles tilde expansion the
>>> same as bash.
>>>
>>> Nick
>>> ​
>>>
>>> On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke <
>>> charles.fed...@gmail.com> wrote:
>>>
>>>> It was only hanging when I specified the path with ~ I never tried
>>>> relative.
>>>>
>>>> Hanging on the waiting for ssh to be ready on all hosts. I let it sit
>>>> for about 10 minutes then I found the StackOverflow answer that suggested
>>>> specifying an absolute path, cancelled, and re-run with --resume and the
>>>> absolute path and all slaves were up in a couple minutes.
>>>>
>>>> (I've stood up 4 integration clusters and 2 production clusters on EC2
>>>> since with no problems.)
>>>>
>>>> On Wed Jan 28 2015 at 12:05:43 PM Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Ey-chih,
>>>>>
>>>>> That makes more sense. This is a known issue that will be fixed as
>>>>> part of SPARK-5242 <https://issues.apache.org/jira/browse/SPARK-5242>.
>>>>>
>>>>> Charles,
>>>>>
>>>>> Thanks for the info. In your case, when does spark-ec2 hang? Only when
>>>>> the specified path to the identity file doesn't exist? Or also when you
>>>>> 

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
If that was indeed the problem, I suggest updating your answer on SO
<http://stackoverflow.com/a/28005151/877069> to help others who may run
into this same problem.
​

On Wed Jan 28 2015 at 9:40:39 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Thanks for sending this over, Peter.
>
> What if you try this? (i.e. Remove the = after --identity-file.)
>
> ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file 
> ~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login pz-spark-cluster
>
> If that works, then I think the problem in this case is simply that Bash
> cannot expand the tilde because it’s stuck to the --identity-file=. This
> isn’t a problem with spark-ec2.
>
> Bash sees the --identity-file=~/.pzkeys/spark-streaming-kp.pem as one big
> argument, so it can’t do tilde expansion.
>
> Nick
> ​
>
> On Wed Jan 28 2015 at 9:17:06 PM Peter Zybrick  wrote:
>
>> Below is trace from trying to access with ~/path.  I also did the echo as
>> per Nick (see the last line), looks ok to me.  This is my development box
>> with Spark 1.2.0 running CentOS 6.5, Python 2.6.6
>>
>> [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2
>> --key-pair=spark-streaming-kp 
>> --identity-file=~/.pzkeys/spark-streaming-kp.pem
>> --region=us-east-1 login pz-spark-cluster
>> Searching for existing cluster pz-spark-cluster...
>> Found 1 master(s), 3 slaves
>> Logging into master ec2-54-152-95-129.compute-1.amazonaws.com...
>> Warning: Identity file ~/.pzkeys/spark-streaming-kp.pem not accessible:
>> No such file or directory.
>> Permission denied (publickey).
>> Traceback (most recent call last):
>>   File "ec2/spark_ec2.py", line 1082, in 
>> main()
>>   File "ec2/spark_ec2.py", line 1074, in main
>> real_main()
>>   File "ec2/spark_ec2.py", line 1007, in real_main
>> ssh_command(opts) + proxy_opt + ['-t', '-t', "%s@%s" % (opts.user,
>> master)])
>>   File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
>> raise CalledProcessError(retcode, cmd)
>> subprocess.CalledProcessError: Command '['ssh', '-o',
>> 'StrictHostKeyChecking=no', '-i', '~/.pzkeys/spark-streaming-kp.pem',
>> '-t', '-t', u'r...@ec2-54-152-95-129.compute-1.amazonaws.com']' returned
>> non-zero exit status 255
>> [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ echo ~/.pzkeys/spark-streaming-kp.
>> pem
>> /home/pete.zybrick/.pzkeys/spark-streaming-kp.pem
>>
>>
>> On Wed, Jan 28, 2015 at 3:49 PM, Charles Feduke > > wrote:
>>
>>> Yeah, I agree ~ should work. And it could have been [read: probably was]
>>> the fact that one of the EC2 hosts was in my known_hosts (don't know, never
>>> saw an error message, but the behavior is no error message for that state),
>>> which I had fixed later with Pete's patch. But the second execution when
>>> things worked with an absolute path could have worked because the random
>>> hosts that came up on EC2 were never in my known_hosts.
>>>
>>>
>>> On Wed Jan 28 2015 at 3:45:36 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Hmm, I can’t see why using ~ would be problematic, especially if you
>>>> confirm that echo ~/path/to/pem expands to the correct path to your
>>>> identity file.
>>>>
>>>> If you have a simple reproduction of the problem, please send it over.
>>>> I’d love to look into this. When I pass paths with ~ to spark-ec2 on my
>>>> system, it works fine. I’m using bash, but zsh handles tilde expansion the
>>>> same as bash.
>>>>
>>>> Nick
>>>> ​
>>>>
>>>> On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke <
>>>> charles.fed...@gmail.com> wrote:
>>>>
>>>>> It was only hanging when I specified the path with ~ I never tried
>>>>> relative.
>>>>>
>>>>> Hanging on the waiting for ssh to be ready on all hosts. I let it sit
>>>>> for about 10 minutes then I found the StackOverflow answer that suggested
>>>>> specifying an absolute path, cancelled, and re-run with --resume and the
>>>>> absolute path and all slaves were up in a couple minutes.
>>>>>
>>>>> (I've stood up 4 integration clusters and 2 production clusters on EC2
>>>>> since with no problems.)
>>>>>
>>>>

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Hmm, I can’t see why using ~ would be problematic, especially if you
confirm that echo ~/path/to/pem expands to the correct path to your
identity file.

If you have a simple reproduction of the problem, please send it over. I’d
love to look into this. When I pass paths with ~ to spark-ec2 on my system,
it works fine. I’m using bash, but zsh handles tilde expansion the same as
bash.

Nick
​

On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke 
wrote:

> It was only hanging when I specified the path with ~ I never tried
> relative.
>
> Hanging on the waiting for ssh to be ready on all hosts. I let it sit for
> about 10 minutes then I found the StackOverflow answer that suggested
> specifying an absolute path, cancelled, and re-run with --resume and the
> absolute path and all slaves were up in a couple minutes.
>
> (I've stood up 4 integration clusters and 2 production clusters on EC2
> since with no problems.)
>
> On Wed Jan 28 2015 at 12:05:43 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Ey-chih,
>>
>> That makes more sense. This is a known issue that will be fixed as part
>> of SPARK-5242 <https://issues.apache.org/jira/browse/SPARK-5242>.
>>
>> Charles,
>>
>> Thanks for the info. In your case, when does spark-ec2 hang? Only when
>> the specified path to the identity file doesn't exist? Or also when you
>> specify the path as a relative path or with ~?
>>
>> Nick
>>
>>
>> On Wed Jan 28 2015 at 9:29:34 AM ey-chih chow  wrote:
>>
>>> We found the problem and already fixed it.  Basically, spark-ec2
>>> requires ec2 instances to have external ip addresses. You need to specify
>>> this in the ASW console.
>>> --
>>> From: nicholas.cham...@gmail.com
>>> Date: Tue, 27 Jan 2015 17:19:21 +
>>> Subject: Re: spark 1.2 ec2 launch script hang
>>> To: charles.fed...@gmail.com; pzybr...@gmail.com; eyc...@hotmail.com
>>> CC: user@spark.apache.org
>>>
>>>
>>> For those who found that absolute vs. relative path for the pem file
>>> mattered, what OS and shell are you using? What version of Spark are you
>>> using?
>>>
>>> ~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to
>>> the absolute path before sending it to spark-ec2. (i.e. tilde
>>> expansion.)
>>>
>>> Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t
>>> matter, since we fixed that for Spark 1.2.0
>>> <https://issues.apache.org/jira/browse/SPARK-4137>. Maybe there’s some
>>> case that we missed?
>>>
>>> Nick
>>>
>>> On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke <
>>> charles.fed...@gmail.com> wrote:
>>>
>>>
>>> Absolute path means no ~ and also verify that you have the path to the
>>> file correct. For some reason the Python code does not validate that the
>>> file exists and will hang (this is the same reason why ~ hangs).
>>> On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick 
>>> wrote:
>>>
>>> Try using an absolute path to the pem file
>>>
>>>
>>>
>>> > On Jan 26, 2015, at 8:57 PM, ey-chih chow  wrote:
>>> >
>>> > Hi,
>>> >
>>> > I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
>>> > modified the script according to
>>> >
>>> > https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab
>>> 9753aae939b3bb33be953e2c13a70
>>> >
>>> > But the script was still hung at the following message:
>>> >
>>> > Waiting for cluster to enter 'ssh-ready'
>>> > state.
>>> >
>>> > Any additional thing I should do to make it succeed?  Thanks.
>>> >
>>> >
>>> > Ey-Chih Chow
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>> ​
>>>
>>


Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Ey-chih,

That makes more sense. This is a known issue that will be fixed as part of
SPARK-5242 .

Charles,

Thanks for the info. In your case, when does spark-ec2 hang? Only when the
specified path to the identity file doesn't exist? Or also when you specify
the path as a relative path or with ~?

Nick


On Wed Jan 28 2015 at 9:29:34 AM ey-chih chow  wrote:

> We found the problem and already fixed it.  Basically, spark-ec2 requires
> ec2 instances to have external ip addresses. You need to specify this in
> the ASW console.
> --
> From: nicholas.cham...@gmail.com
> Date: Tue, 27 Jan 2015 17:19:21 +
> Subject: Re: spark 1.2 ec2 launch script hang
> To: charles.fed...@gmail.com; pzybr...@gmail.com; eyc...@hotmail.com
> CC: user@spark.apache.org
>
>
> For those who found that absolute vs. relative path for the pem file
> mattered, what OS and shell are you using? What version of Spark are you
> using?
>
> ~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to
> the absolute path before sending it to spark-ec2. (i.e. tilde expansion.)
>
> Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t
> matter, since we fixed that for Spark 1.2.0
> . Maybe there’s some
> case that we missed?
>
> Nick
>
> On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke 
> wrote:
>
>
> Absolute path means no ~ and also verify that you have the path to the
> file correct. For some reason the Python code does not validate that the
> file exists and will hang (this is the same reason why ~ hangs).
> On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick  wrote:
>
> Try using an absolute path to the pem file
>
>
>
> > On Jan 26, 2015, at 8:57 PM, ey-chih chow  wrote:
> >
> > Hi,
> >
> > I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
> > modified the script according to
> >
> > https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab
> 9753aae939b3bb33be953e2c13a70
> >
> > But the script was still hung at the following message:
> >
> > Waiting for cluster to enter 'ssh-ready'
> > state.
> >
> > Any additional thing I should do to make it succeed?  Thanks.
> >
> >
> > Ey-Chih Chow
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> ​
>


Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Nicholas Chammas
For those who found that absolute vs. relative path for the pem file
mattered, what OS and shell are you using? What version of Spark are you
using?

~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to the
absolute path before sending it to spark-ec2. (i.e. tilde expansion.)

Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t matter,
since we fixed that for Spark 1.2.0
. Maybe there’s some case
that we missed?

Nick

On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke 
wrote:

Absolute path means no ~ and also verify that you have the path to the file
> correct. For some reason the Python code does not validate that the file
> exists and will hang (this is the same reason why ~ hangs).
> On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick  wrote:
>
>> Try using an absolute path to the pem file
>>
>>
>>
>> > On Jan 26, 2015, at 8:57 PM, ey-chih chow  wrote:
>> >
>> > Hi,
>> >
>> > I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
>> > modified the script according to
>> >
>> > https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab
>> 9753aae939b3bb33be953e2c13a70
>> >
>> > But the script was still hung at the following message:
>> >
>> > Waiting for cluster to enter 'ssh-ready'
>> > state.
>> >
>> > Any additional thing I should do to make it succeed?  Thanks.
>> >
>> >
>> > Ey-Chih Chow
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>  ​


Re: saving rdd to multiple files named by the key

2015-01-27 Thread Nicholas Chammas
There is also SPARK-3533 ,
which proposes to add a convenience method for this.
​

On Mon Jan 26 2015 at 10:38:56 PM Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> This might be helpful:
> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
>
> On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport  wrote:
>
>> Hi,
>>
>> I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
>> I got them by combining many [k,v] by [k]. I could then save to file by
>> partitions, but that still doesn't allow me to choose the name, and leaves
>> me stuck with foo/part-...
>>
>> Any tips?
>>
>> Thanks,
>> Sharon
>>
>


Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-24 Thread Nicholas Chammas
I believe databricks provides an rdd interface to redshift. Did you check
spark-packages.org?
On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin 
wrote:

> Hello,
>
> we've got some analytics data in AWS Redshift. The data is being
> constantly updated.
>
> I'd like to be able to write a query against Redshift which would return a
> subset of data, and then run a Spark job (Pyspark) to do some analysis.
>
> I could not find an RDD which would let me do it OOB (Python), so I tried
> writing my own. For example, tried combination of a generator (via yield)
> with parallelize. It appears though that "parallelize" reads all the data
> first into memory as I get either OOM or Python swaps as soon as I increase
> the number of rows beyond trivial limits.
>
> I've also looked at Java RDDs (there is an example of MySQL RDD) but it
> seems that it also reads all the data into memory.
>
> So my question is - how to correctly feed Spark with huge datasets which
> don't initially reside in HDFS/S3 (ideally for Pyspark, but would
> appreciate any tips)?
>
> Thanks.
>
> Denis
>
>
>


Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-5390

On Fri Jan 23 2015 at 12:05:00 PM Gerard Maas  wrote:

> +1
>
> On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> That sounds good to me. Shall I open a JIRA / PR about updating the site
>> community page?
>> On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell 
>> wrote:
>>
>>> Hey Nick,
>>>
>>> So I think we what can do is encourage people to participate on the
>>> stack overflow topic, and this I think we can do on the Spark website
>>> as a first class community resource for Spark. We should probably be
>>> spending more time on that site given its popularity.
>>>
>>> In terms of encouraging this explicitly *to replace* the ASF mailing
>>> list, that I think is harder to do. The ASF makes a lot of effort to
>>> host its own infrastructure that is neutral and not associated with
>>> any corporation. And by and large the ASF policy is to consider that
>>> as the de-facto forum of communication for any project.
>>>
>>> Personally, I wish the ASF would update this policy - for instance, by
>>> allowing the use of third party lists or communication fora - provided
>>> that they allow exporting the conversation if those sites were to
>>> change course. However, the state of the art stands as such.
>>>
>>> - Patrick
>>>
>>>
>>> On Wed, Jan 21, 2015 at 8:43 AM, Nicholas Chammas
>>>  wrote:
>>> > Josh / Patrick,
>>> >
>>> > What do y’all think of the idea of promoting Stack Overflow as a place
>>> to
>>> > ask questions over this list, as long as the questions fit SO’s
>>> guidelines
>>> > (how-to-ask, dont-ask)?
>>> >
>>> > The apache-spark tag is very active on there.
>>> >
>>> > Discussions of all types are still on-topic here, but when possible we
>>> want
>>> > to encourage people to use SO.
>>> >
>>> > Nick
>>> >
>>> > On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
>>> wrote:
>>> >>
>>> >> Its a very valid  idea indeed, but... It's a tricky  subject since the
>>> >> entire ASF is run on mailing lists , hence there are so many
>>> different but
>>> >> equally sound ways of looking at this idea, which conflict with one
>>> another.
>>> >>
>>> >> > On Jan 21, 2015, at 7:03 AM, btiernay  wrote:
>>> >> >
>>> >> > I think this is a really great idea for really opening up the
>>> >> > discussions
>>> >> > that happen here. Also, it would be nice to know why there doesn't
>>> seem
>>> >> > to
>>> >> > be much interest. Maybe I'm misunderstanding some nuance of Apache
>>> >> > projects.
>>> >> >
>>> >> > Cheers
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > View this message in context:
>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/
>>> Discourse-A-proposed-alternative-to-the-Spark-User-
>>> list-tp20851p21288.html
>>> >> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >> >
>>> >> > 
>>> -
>>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>


Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
That sounds good to me. Shall I open a JIRA / PR about updating the site
community page?
On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell 
wrote:

> Hey Nick,
>
> So I think we what can do is encourage people to participate on the
> stack overflow topic, and this I think we can do on the Spark website
> as a first class community resource for Spark. We should probably be
> spending more time on that site given its popularity.
>
> In terms of encouraging this explicitly *to replace* the ASF mailing
> list, that I think is harder to do. The ASF makes a lot of effort to
> host its own infrastructure that is neutral and not associated with
> any corporation. And by and large the ASF policy is to consider that
> as the de-facto forum of communication for any project.
>
> Personally, I wish the ASF would update this policy - for instance, by
> allowing the use of third party lists or communication fora - provided
> that they allow exporting the conversation if those sites were to
> change course. However, the state of the art stands as such.
>
> - Patrick
>
> On Wed, Jan 21, 2015 at 8:43 AM, Nicholas Chammas
>  wrote:
> > Josh / Patrick,
> >
> > What do y’all think of the idea of promoting Stack Overflow as a place to
> > ask questions over this list, as long as the questions fit SO’s
> guidelines
> > (how-to-ask, dont-ask)?
> >
> > The apache-spark tag is very active on there.
> >
> > Discussions of all types are still on-topic here, but when possible we
> want
> > to encourage people to use SO.
> >
> > Nick
> >
> > On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
> wrote:
> >>
> >> Its a very valid  idea indeed, but... It's a tricky  subject since the
> >> entire ASF is run on mailing lists , hence there are so many different
> but
> >> equally sound ways of looking at this idea, which conflict with one
> another.
> >>
> >> > On Jan 21, 2015, at 7:03 AM, btiernay  wrote:
> >> >
> >> > I think this is a really great idea for really opening up the
> >> > discussions
> >> > that happen here. Also, it would be nice to know why there doesn't
> seem
> >> > to
> >> > be much interest. Maybe I'm misunderstanding some nuance of Apache
> >> > projects.
> >> >
> >> > Cheers
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> > http://apache-spark-user-list.1001560.n3.nabble.com/
> Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851p21288.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > -
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
I agree with Sean that a Spark-specific Stack Exchange likely won't help
and almost certainly won't make it out of Area 51. The idea certainly
sounds nice from our perspective as Spark users, but it doesn't mesh with
the structure of Stack Exchange or the criteria for creating new sites.

On Thu Jan 22 2015 at 1:23:14 PM Sean Owen  wrote:

> FWIW I am a moderator for datascience.stackexchange.com, and even that
> hasn't really achieved the critical mass that SE sites are supposed
> to: http://area51.stackexchange.com/proposals/55053/data-science
>
> I think a Spark site would have a lot less traffic. One annoyance is
> that people can't figure out when to post on SO vs Data Science vs
> Cross Validated. A Spark site would have the same problem,
> fragmentation and cross posting with SO. I don't think this would be
> accepted as a StackExchange site and don't think it helps.
>
> On Thu, Jan 22, 2015 at 6:16 PM, pierred  wrote:
> >
> > A dedicated stackexchange site for Apache Spark sounds to me like the
> > logical solution.  Less trolling, more enthusiasm, and with the
> > participation of the people on this list, I think it would very quickly
> > become the reference for many technical questions, as well as a great
> > vehicle to promote the awesomeness of Spark.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
we could implement some ‘load balancing’ policies:

I think Gerard’s suggestions are good. We need some “official” buy-in from
the project’s maintainers and heavy contributors and we should move forward
with them.

I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are
active on this list, are also active on SO
<http://stackoverflow.com/tags/apache-spark/topusers>. So perhaps we’re
already part of the way there.

Nick
​

On Thu Jan 22 2015 at 5:32:40 AM Gerard Maas  wrote:

> I've have been contributing to SO for a while now.  Here're few
> observations I'd like to contribute to the discussion:
>
> The level of questions on SO is often of more entry-level. "Harder"
> questions (that require expertise in a certain area) remain unanswered for
> a while. Same questions here on the list (as they are often cross-posted)
> receive faster turnaround.
> Roughly speaking, there're two groups of questions: Implementing things on
> Spark and Running Spark.  The second one is borderline on SO guidelines as
> they often involve cluster setups, long logs and little idea of what's
> going on (mind you, often those questions come from people starting with
> Spark)
>
> In my opinion, Stack Overflow offers a better Q/A experience, in
> particular, they have tooling in place to reduce duplicates, something that
> often overloads this list (same "getting started issues" or "how to map,
> filter, flatmap" over and over again).  That said, this list offers a
> richer forum, where the expertise pool is a lot deeper.
> Also, while SO is fairly strict in requiring posters from showing a
> minimal amount of effort in the question being asked, this list is quite
> friendly to the same behavior. This could be probably an element that makes
> the list 'lower impedance'.
> One additional thing on SO is that the [apache-spark] tag is a 'low rep'
> tag. Neither questions nor answers get significant voting, reducing the
> 'rep gaming' factor  (discouraging participation?)
>
> Thinking about how to improve both platforms: SO[apache-spark] and this
> ML, and get back the list to "not overwhelming" message volumes, we could
> implement some 'load balancing' policies:
> - encourage new users to use Stack Overflow, in particular, redirect
> newbie questions to SO the friendly way: "did you search SO already?" or
> link to an existing question.
>   - most how to "map, flatmap, filter, aggregate, reduce, ..." would fall
> under  this category
> - encourage domain experts to hang on SO more often  (my impression is
> that MLLib, GraphX are fairly underserved)
> - have an 'scalation process' in place, where we could post
> 'interesting/hard/bug' questions from SO back to the list (or encourage the
> poster to do so)
> - update our "community guidelines" on [
> http://spark.apache.org/community.html] to implement such policies.
>
> Those are just some ideas on how to improve the community and better serve
> the newcomers while avoiding overload of our existing expertise pool.
>
> kr, Gerard.
>
>
> On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen  wrote:
>
>> Yes, there is some project business like votes of record on releases that
>> needs to be carried on in standard, simple accessible place and SO is not
>> at all suitable.
>>
>> Nobody is stuck with Nabble. The suggestion is to enable a different
>> overlay on the existing list. SO remains a place you can ask questions too.
>> So I agree with Nick's take.
>>
>> BTW are there perhaps plans to split this mailing list into
>> subproject-specific lists? That might also help tune in/out the subset of
>> conversations of interest.
>> On Jan 22, 2015 10:30 AM, "Petar Zecevic" 
>> wrote:
>>
>>>
>>> Ok, thanks for the clarifications. I didn't know this list has to remain
>>> as the only official list.
>>>
>>> Nabble is really not the best solution in the world, but we're stuck
>>> with it, I guess.
>>>
>>> That's it from me on this subject.
>>>
>>> Petar
>>>
>>>
>>> On 22.1.2015. 3:55, Nicholas Chammas wrote:
>>>
>>>  I think a few things need to be laid out clearly:
>>>
>>>1. This mailing list is the “official” user discussion platform.
>>>That is, it is sponsored and managed by the ASF.
>>>2. Users are free to organize independent discussion platforms
>>>focusing on Spark, and there is already one such platform in Stack 
>>> Overflow
>>>under the apache-spark 

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
I think a few things need to be laid out clearly:

   1. This mailing list is the “official” user discussion platform. That
   is, it is sponsored and managed by the ASF.
   2. Users are free to organize independent discussion platforms focusing
   on Spark, and there is already one such platform in Stack Overflow under
   the apache-spark and related tags. Stack Overflow works quite well.
   3. The ASF will not agree to deprecating or migrating this user list to
   a platform that they do not control.
   4. This mailing list has grown to an unwieldy size and discussions are
   hard to find or follow; discussion tooling is also lacking. We want to
   improve the utility and user experience of this mailing list.
   5. We don’t want to fragment this “official” discussion community.
   6. Nabble is an independent product not affiliated with the ASF. It
   offers a slightly better interface to the Apache mailing list archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new questions, if that’s
possible) and redirect users to Stack Overflow (automatic reply?).

>From what I understand of the ASF’s policies, this is not possible. :( This
mailing list must remain the official Spark user discussion platform.

Other thing, about new Stack Exchange site I proposed earlier. If a new
site is created, there is no problem with guidelines, I think, because
Spark community can apply different guidelines for the new site.

I think Stack Overflow and the various Spark tags are working fine. I don’t
see a compelling need for a Stack Exchange dedicated to Spark, either now
or in the near future. Also, I doubt a Spark-specific site can pass the 4
tests in the Area 51 FAQ :

   - Almost all Spark questions are on-topic for Stack Overflow
   - Stack Overflow already exists, it already has a tag for Spark, and
   nobody is complaining
   - You’re not creating such a big group that you don’t have enough
   experts to answer all possible questions
   - There’s a high probability that users of Stack Overflow would enjoy
   seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some project-specific sites,
like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger than the
Spark community is or likely ever will be, simply due to the nature of the
problems they are solving.

What we need is an improvement to this mailing list. We need better tooling
than Nabble to sit on top of the Apache archives, and we also need some way
to control the volume and quality of mail on the list so that it remains a
useful resource for the majority of users.

Nick
​

On Wed Jan 21 2015 at 3:13:21 PM pzecevic  wrote:

> Hi,
> I tried to find the last reply by Nick Chammas (that I received in the
> digest) using the Nabble web interface, but I cannot find it (perhaps he
> didn't reply directly to the user list?). That's one example of Nabble's
> usability.
>
> Anyhow, I wanted to add my two cents...
>
> Apache user group could be frozen (not accepting new questions, if that's
> possible) and redirect users to Stack Overflow (automatic reply?). Old
> questions remain (and are searchable) on Nabble, new questions go to Stack
> Exchange, so no need for migration. That's the idea, at least, as I'm not
> sure if that's technically doable... Is it?
> dev mailing list could perhaps stay on Nabble (it's not that busy), or have
> a special tag on Stack Exchange.
>
> Other thing, about new Stack Exchange site I proposed earlier. If a new
> site
> is created, there is no problem with guidelines, I think, because Spark
> community can apply different guidelines for the new site.
>
> There is a FAQ about creating new sites: http://area51.stackexchange.
> com/faq
> It says: "Stack Exchange sites are free to create and free to use. All we
> ask is that you have an enthusiastic, committed group of expert users who
> check in regularly, asking and answering questions."
> I think this requirement is satisfied...
> Someone expressed a concern that they won't allow creating a
> project-specific site, but there already exist some project-specific sites,
> like Tor, Drupal, Ubuntu...
>
> Later, though, the FAQ also says:
> "If Y already exists, it already has a tag for X, and nobody is
> complaining"
> (then you should not create a new site). But we could complain :)
>
> The advantage of having a separate site is that users, who should have more
> privileges, would need to earn them through Spark questions and answers
> only. The other thing, already mentioned, is that the community could
> create
> Spark specific guidelines. There are also  'meta' sites for asking
> questions
> like this one, etc.
>
> There is a process for starting a site - it's not instantaneous. New site
> needs to go through private beta and pu

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
Josh / Patrick,

What do y’all think of the idea of promoting Stack Overflow as a place to
ask questions over this list, as long as the questions fit SO’s guidelines (
how-to-ask , dont-ask
)?

The apache-spark 
tag is very active on there.

Discussions of all types are still on-topic here, but when possible we want
to encourage people to use SO.

Nick

On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
 wrote:

Its a very valid  idea indeed, but... It's a tricky  subject since the
> entire ASF is run on mailing lists , hence there are so many different but
> equally sound ways of looking at this idea, which conflict with one another.
>
> > On Jan 21, 2015, at 7:03 AM, btiernay  wrote:
> >
> > I think this is a really great idea for really opening up the discussions
> > that happen here. Also, it would be nice to know why there doesn't seem
> to
> > be much interest. Maybe I'm misunderstanding some nuance of Apache
> projects.
> >
> > Cheers
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-
> list-tp20851p21288.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>  ​


Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-20 Thread Nicholas Chammas
Are the gz files roughly equal in size? Do you know that your partitions
are roughly balanced? Perhaps some cores get assigned tasks that end very
quickly, while others get most of the work.

On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil 
wrote:

> Hi,
>
> Thanks for getting back to me. Sorry for the delay. I am still having
> this issue.
>
> @sun: To clarify, The machine actually has 16 usable threads and the
> job has more than 100 gzip files. So, there are enough partitions to
> use all threads.
>
> @nicholas: The number of partitions match the number of files: > 100.
>
> @Sebastian: I understand the lazy loading behavior. For this reason, I
> usually use a .count() to force the transformation (.first() will not
> be enough). Still, during the transformation, only 4 cores are used
> for processing the input files.
>
> I don't know if this issue is noticed by other people. Can anyone
> reproduce it with v1.1?
>
>
> On Wed, Dec 17, 2014 at 2:14 AM, Nicholas Chammas
>  wrote:
> > Rui is correct.
> >
> > Check how many partitions your RDD has after loading the gzipped files.
> e.g.
> > rdd.getNumPartitions().
> >
> > If that number is way less than the number of cores in your cluster (in
> your
> > case I suspect the number is 4), then explicitly repartition the RDD to
> > match the number of cores in your cluster, or some multiple thereof.
> >
> > For example:
> >
> > new_rdd = rdd.repartition(sc.defaultParallelism * 3)
> >
> > Operations on new_rdd should utilize all the cores in your cluster.
> >
> > Nick
> >
> >
> > On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui  wrote:
> >>
> >> Gautham,
> >>
> >> How many number of gz files do you have?  Maybe the reason is that gz
> file
> >> is compressed that can't be splitted for processing by Mapreduce. A
> single
> >> gz  file can only be processed by a single Mapper so that the CPU treads
> >> can't be fully utilized.
> >>
> >> -Original Message-
> >> From: Gautham [mailto:gautham.a...@gmail.com]
> >> Sent: Wednesday, December 10, 2014 3:00 AM
> >> To: u...@spark.incubator.apache.org
> >> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
> >>
> >> I am having an issue with pyspark launched in ec2 (using spark-ec2)
> with 5
> >> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I
> do
> >> sc.textFile to load data from a number of gz files, it does not
> progress as
> >> fast as expected. When I log-in to a child node and run top, I see only
> 4
> >> threads at 100 cpu. All remaining 28 cores were idle. This is not an
> issue
> >> when processing the strings after loading, when all the cores are used
> to
> >> process the data.
> >>
> >> Please help me with this? What setting can be changed to get the CPU
> usage
> >> back up to full?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-
> sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> additional
> >> commands, e-mail: user-h...@spark.apache.org
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Gautham Anil
>
> "The first principle is that you must not fool yourself. And you are
> the easiest person to fool" - Richard P. Feynman
>


Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-18 Thread Nicholas Chammas
Nathan,

I posted a bunch of questions for you as a comment on your question
 on Stack Overflow. If you
answer them (don't forget to @ping me) I may be able to help you.

Nick

On Sat Jan 17 2015 at 3:49:54 PM gen tang  wrote:

> Hi,
>
> This is because "ssh-ready" is the ec2 scripy means that all the instances
> are in the status of running and all the instances in the status of "OK",
> In another word, the instances is ready to download and to install
> software, just as emr is ready for bootstrap actions.
> Before, the script just repeatedly prints the information showing that we
> are waiting for every instance being launched.And it is quite ugly, so they
> change the information to print
> However, you can use ssh to connect the instance even if it is in the
> status of pending. If you wait patiently a little more,, the script will
> finish the launch of cluster.
>
> Cheers
> Gen
>
>
> On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy 
> wrote:
>
>> Originally posted here:
>> http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script
>>
>> I'm trying to launch a standalone Spark cluster using its pre-packaged
>> EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:
>>
>> ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k
>>  -i .pem -r us-west-2 -s 3 launch test
>> Setting up security groups...
>> Searching for existing cluster test...
>> Spark AMI: ami-ae6e0d9e
>> Launching instances...
>> Launched 3 slaves in us-west-2c, regid = r-b___6
>> Launched master in us-west-2c, regid = r-0__0
>> Waiting for all instances in cluster to enter 'ssh-ready'
>> state..
>>
>> Yet I can SSH into these instances without compaint:
>>
>> ubuntu@machine:~$ ssh -i .pem root@master-ip
>> Last login: Day MMM DD HH:mm:ss 20YY from
>> c-AA-BBB--DDD.eee1.ff.provider.net
>>
>>__|  __|_  )
>>_|  ( /   Amazon Linux AMI
>>   ___|\___|___|
>>
>> https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
>> There are 59 security update(s) out of 257 total update(s) available
>> Run "sudo yum update" to apply all updates.
>> Amazon Linux version 2014.09 is available.
>> root@ip-internal ~]$
>>
>> I'm trying to figure out if this is a problem in AWS or with the Spark
>> scripts. I've never had this issue before until recently.
>>
>>
>> --
>> Nathan Murthy // 713.884.7110 (mobile) // @natemurthy
>>
>
>


Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread Nicholas Chammas
The Stack Exchange community will not support creating a whole new site
just for Spark (otherwise you’d see dedicated sites for much larger topics
like “Python”). Their tagging system works well enough to separate
questions about different topics, and the apache-spark
 tag on Stack
Overflow is already doing pretty well.

The ASF as well as this community will also not support any migration of
the mailing list to another system due to ASF rules
 and community
fragmentation.

Realistically, the only options available to us that I see are options 1
and 3 from my original email (which can be used together).

Option 3: Change the culture around the user list. Encourage people to use
Stack Overflow whenever possible, and this list only when their question
doesn’t fit SO’s strict rules.

Option 1: Work with the ASF and the Discourse teams to allow Discourse to
be deployed as an overlay on top of this existing mailing list. (e.g. Like
a new UI on top of an old database.)

The goal of both changes would be to make the user list more usable.

Nick

On 2015년 1월 17일 (토) at 오전 8:51 Andrew Ash  wrote:

People can continue using the stack exchange sites as is with no additional
> work from the Spark team.  I would not support migrating our mailing lists
> yet again to another system like Discourse because I fear fragmentation of
> the community between the many sites.
>
> On Sat, Jan 17, 2015 at 6:24 AM, pzecevic  wrote:
>
>> Hi, guys!
>>
>> I'm reviving this old question from Nick Chammas with a new proposal: what
>> do you think about creating a separate Stack Exchange 'Apache Spark' site
>> (like 'philosophy' and 'English' etc.)?
>>
>> I'm not sure what would be the best way to deal with user and dev lists,
>> though - to merge them into one or create two separate sites...
>>
>> And I don't know it it's at all possible to migrate current lists to stack
>> exchange, but I believe it would be an improvement over the current
>> situation. People are used to stack exchange, it's easy to use and search,
>> topics (Spark SQL, Streaming, Graphx) could be marked with tags for easy
>> filtering, code formatting is super easy etc.
>>
>> What do you all think?
>>
>>
>>
>> Nick Chammas wrote
>> > When people have questions about Spark, there are 2 main places (as far
>> as
>> > I can tell) where they ask them:
>> >
>> >- Stack Overflow, under the apache-spark tag
>> >;
>> >- This mailing list
>> >
>> > The mailing list is valuable as an independent place for discussion that
>> > is
>> > part of the Spark project itself. Furthermore, it allows for a broader
>> > range of discussions than would be allowed on Stack Overflow
>> > ;.
>> >
>> > As the Spark project has grown in popularity, I see that a few problems
>> > have emerged with this mailing list:
>> >
>> >- It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
>> >interested in, and it’s hard to know when someone has mentioned you
>> >specifically.
>> >- It’s hard to search for existing threads and link information
>> across
>> >disparate threads.
>> >- It’s hard to format code and log snippets nicely, and by extension,
>> >hard to read other people’s posts with this kind of information.
>> >
>> > There are existing solutions to all these (and other) problems based
>> > around
>> > straight-up discipline or client-side tooling, which users have to
>> conjure
>> > up for themselves.
>> >
>> > I’d like us as a community to consider using Discourse
>> > ; as an alternative to, or overlay on
>> top
>> > of,
>> > this mailing list, that provides better out-of-the-box solutions to
>> these
>> > problems.
>> >
>> > Discourse is a modern discussion platform built by some of the same
>> people
>> > who created Stack Overflow. It has many neat features
>> > ; that I believe this community
>> would
>> > benefit from.
>> >
>> > For example:
>> >
>> >- When a user starts typing up a new post, they get a panel *showing
>> >existing conversations that look similar*, just like on Stack
>> Overflow.
>> >- It’s easy to search for posts and link between them.
>> >- *Markdown support* is built-in to composer.
>> >- You can *specifically mention people* and they will be notified.
>> >- Posts can be categorized (e.g. Streaming, SQL, etc.).
>> >- There is a built-in option for mailing list support which forwards
>> > all
>> >activity on the forum to a user’s email address and which allows for
>> >creation of new posts via email.
>> >
>> > What do you think of Discourse as an alternative, more manageable way to
>> > discus Spark?
>> >
>> > There are a few options we can consider:
>> >
>> >1. Work with the ASF as well as the Discourse team to allow Disco

Re: dockerized spark executor on mesos?

2015-01-15 Thread Nicholas Chammas
The AMPLab maintains a bunch of Docker files for Spark here:
https://github.com/amplab/docker-scripts

Hasn't been updated since 1.0.0, but might be a good starting point.

On Wed Jan 14 2015 at 12:14:13 PM Josh J  wrote:

> We have dockerized Spark Master and worker(s) separately and are using it
>> in
>> our dev environment.
>
>
> Is this setup available on github or dockerhub?
>
> On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian 
> wrote:
>
>> We have dockerized Spark Master and worker(s) separately and are using it
>> in
>> our dev environment. We don't use Mesos though, running it in Standalone
>> mode, but adding Mesos should not be that difficult I think.
>>
>> Regards
>>
>> Venkat
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


  1   2   3   4   >