[no subject]

2016-10-06 Thread ayan guha
Hi

Faced one issue:

- Writing Hive Partitioned table using

df.withColumn("partition_date",to_date(df["INTERVAL_DATE"])).write.partitionBy('partition_date').saveAsTable("sometable",mode="overwrite")

- Data got written to HDFS fine. I can see the folders with partition names
such as

/app/somedb/hive/somedb.db/sometable/partition_date=2016-09-28
/app/somedb/hive/somedb.db/sometable/partition_date=2016-09-29

and so on.
- Also, _common_metadata & _metadata files are written properly

- I can read data from spark fine using
read.parquet("/app/somedb/hive/somedb.db/sometable"). Printschema showing
all columns.

- However, I can not read from hive.

Problem 1: Hive does not think the table is partitioned
Problem 2: Hive sees only 1 column
array from deserializer
Problem 3: MSCK repair table failed, saying partitions are not in Metadata.

Question: Is it a known issue with Spark to write to Hive partitioned table?


-- 
Best Regards,
Ayan Guha


Re: spark standalone with multiple workers gives a warning

2016-10-06 Thread Ofer Eliassaf
The slaves should connect to the master using the scripts in sbin...
You can read about it here:
http://spark.apache.org/docs/latest/spark-standalone.html

On Thu, Oct 6, 2016 at 6:46 PM, Mendelson, Assaf 
wrote:

> Hi,
>
> I have a spark standalone cluster. On it, I am using 3 workers per node.
>
> So I added SPARK_WORKER_INSTANCES set to 3 in spark-env.sh
>
> The problem is, that when I run spark-shell I get the following warning:
>
> WARN SparkConf:
>
> SPARK_WORKER_INSTANCES was detected (set to '3').
>
> This is deprecated in Spark 1.0+.
>
>
>
> Please instead use:
>
> - ./spark-submit with --num-executors to specify the number of executors
>
> - Or set SPARK_EXECUTOR_INSTANCES
>
> - spark.executor.instances to configure the number of instances in the
> spark config.
>
>
>
> So how would I start a cluster of 3? SPARK_WORKER_INSTANCES is the only
> way I see to start the standalone cluster and the only way I see to define
> it is in spark-env.sh. The spark submit option, SPARK_EXECUTOR_INSTANCES
> and spark.executor.instances are all related to submitting the job.
>
>
>
> Any ideas?
>
> Thanks
>
> Assaf
>



-- 
Regards,
Ofer Eliassaf


spark stateful streaming error

2016-10-06 Thread backtrack5
I am using pyspark stateful stream (2.0), which receives JSON from Socket. I
am getting the following error, When i send more then one records. meaning
if i send only one message i am getting response. If i send more than one
message getting following error,

def createmd5Hash(po):
data = json.loads(po)
return(hashlib.md5(data['somevalue'].encode('utf-8')).hexdigest(),data)
Implementation 1:

stream = ssc.socketTextStream("ip",  3341)
ssc.checkpoint('E:\\Work\\Python1\\work\\spark\\checkpoint\\')
initialStateRDD = sc.parallelize([(u'na', 1)])
with_hash = stream.map(lambda po : createmd5Hash(po)).reduceByKey(lambda
s1,s2:s1)
Implementation 2:

stream = ssc.textFileStream('C:\\sparkpoc\\input')
ssc.checkpoint('E:\\Work\\Python1\\work\\spark\\checkpoint\\')
initialStateRDD = sc.parallelize([(u'na', 1)])
with_hash = stream.map(lambda po : createmd5Hash(po)).reduceByKey(lambda
s1,s2:s1)


To be specific i am getting expected result when i read json from File
system textFileStream. But getting follwoing error when i use the socket
stream socketTextStream

16/10/06 20:50:42 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"E:\Work\spark\installtion\spark\python\lib\pyspark.zip\pyspark\worker.py",
line 172, in main
  File
"E:\Work\spark\installtion\spark\python\lib\pyspark.zip\pyspark\worker.py",
line 167, in process
  File "E:\Work\spark\installtion\spark\python\pyspark\rdd.py", line 2371,
in pipeline_func
return func(split, prev_func(split, iterator))
  File "E:\Work\spark\installtion\spark\python\pyspark\rdd.py", line 2371,
in pipeline_func
return func(split, prev_func(split, iterator))
  File "E:\Work\spark\installtion\spark\python\pyspark\rdd.py", line 317, in
func
return f(iterator)
  File "E:\Work\spark\installtion\spark\python\pyspark\rdd.py", line 1792,
in combineLocally
merger.mergeValues(iterator)
  File
"E:\Work\spark\installtion\spark\python\lib\pyspark.zip\pyspark\shuffle.py",
line 236, in mergeValues
for k, v in iterator:
  File "E:/Work/Python1/work/spark/streamexample.py", line 159, in 
with_hash = stream.map(lambda po : createmd5Hash(po)).reduceByKey(lambda
s1,s2:s1)
  File "E:/Work/Python1/work/spark/streamexample.py", line 31, in
createmd5Hash
data = json.loads(input_line)
  File "C:\Python34\lib\json\__init__.py", line 318, in loads
return _default_decoder.decode(s)
  File "C:\Python34\lib\json\decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python34\lib\json\decoder.py", line 361, in raw_decode
raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 1 column 1 (char 0)

at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can someone please help me ?
http://stackoverflow.com/questions/39897475/spark-stateful-streaming-error





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stateful-streaming-error-tp27851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-06 Thread Koert Kuipers
if the intention is to create this on the default hadoop filesystem (and
not local), then maybe we can use FileSystem.getHomeDirectory()? it should
return the correct home directory on the relevant FileSystem (local or
hdfs).

if the intention is to create this only locally, then why bother using
hadoop filesystem api at all?

On Thu, Oct 6, 2016 at 9:45 AM, Koert Kuipers  wrote:

> well it seems to work if set spark.sql.warehouse.dir to
> /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs.
>
> however can this directory safely be shared between multiple users running
> jobs?
>
> if not then i need to set this per user (instead of single setting in
> spark-defaults) which means i need to change the jobs, which means an
> upgrade for a production cluster running many jobs becomes more difficult.
>
> or can i create a setting in spark-defaults that includes a reference to
> the user? something like /tmp/{user}/spark-warehouse?
>
>
>
> On Thu, Oct 6, 2016 at 6:04 AM, Sean Owen  wrote:
>
>> Yeah I see the same thing. You can fix this by setting
>> spark.sql.warehouse.dir of course as a workaround. I restarted a
>> conversation about it at https://github.com/apache/s
>> park/pull/13868#pullrequestreview-3081020
>>
>> I think the question is whether spark-warehouse is always supposed to be
>> a local dir, or could be an HDFS dir? a change is needed either way, just
>> want to clarify what it is.
>>
>>
>> On Thu, Oct 6, 2016 at 5:18 AM Koert Kuipers  wrote:
>>
>>> i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
>>> and copied over the configs.
>>>
>>> to give it a quick test i started spark-shell and created a dataset. i
>>> get this:
>>>
>>> 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
>>> some configuration may not take effect.
>>> Spark context Web UI available at http://***:4040
>>> Spark context available as 'sc' (master = yarn, app id =
>>> application_1471212701720_1580).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>>>   /_/
>>>
>>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.7.0_75)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> import spark.implicits._
>>> import spark.implicits._
>>>
>>> scala> val x = List(1,2,3).toDS
>>> org.apache.spark.SparkException: Unable to create database default as
>>> failed to create its directory hdfs://dev/home/koert/spark-warehouse
>>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.lifted
>>> Tree1$1(InMemoryCatalog.scala:114)
>>>   at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.create
>>> Database(InMemoryCatalog.scala:108)
>>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createD
>>> atabase(SessionCatalog.scala:147)
>>>   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(
>>> SessionCatalog.scala:89)
>>>   at org.apache.spark.sql.internal.SessionState.catalog$lzycomput
>>> e(SessionState.scala:95)
>>>   at org.apache.spark.sql.internal.SessionState.catalog(SessionSt
>>> ate.scala:95)
>>>   at org.apache.spark.sql.internal.SessionState$$anon$1.(Se
>>> ssionState.scala:112)
>>>   at org.apache.spark.sql.internal.SessionState.analyzer$lzycompu
>>> te(SessionState.scala:112)
>>>   at org.apache.spark.sql.internal.SessionState.analyzer(SessionS
>>> tate.scala:111)
>>>   at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed
>>> (QueryExecution.scala:49)
>>>   at org.apache.spark.sql.Dataset.(Dataset.scala:161)
>>>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>>>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
>>>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession
>>> .scala:423)
>>>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:380)
>>>   at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQ
>>> LImplicits.scala:171)
>>>   ... 50 elided
>>>
>>> this did not happen in spark 2.0.0
>>> the location it is trying to access makes little sense, since it is
>>> going to hdfs but then it is looking for my local home directory
>>> (/home/koert exists locally but not on hdfs).
>>>
>>> i suspect the issue is SPARK-15899, but i am not sure. in the pullreq
>>> for that WAREHOUSE_PATH got changed:
>>>val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>>val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
>>>  .doc("The default location for managed databases and tables.")
>>>  .doc("The default location for managed databases and tables.")
>>>  .stringConf
>>>  -.createWithDefault("file:${system:user.dir}/spark-warehouse")
>>>  +.createWithDefault("${system:user.dir}/spark-warehouse")
>>>
>>> notice how the file: got removed from the url, causing spark to 

Spark REST API YARN client mode is not full?

2016-10-06 Thread Vladimir Tretyakov
Hi,

When I start Spark  v1.6 (cdh5.8.0) in Yarn client mode I see that 4040
port is avaiable, but UI shows nothing and API returns not full information.

I started Spark application like this:

spark-submit --master yarn-client --class
org.apache.spark.examples.SparkPi
/usr/lib/spark/examples/lib/spark-examples-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar
1

API returns me:

http://localhost:4040/api/v1/applications

[ {
  "name" : "Spark Pi",
  "attempts" : [ {
"startTime" : "2016-10-05T11:27:54.558GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"sparkUser" : "",
"completed" : false
  } ]
} ]

Where is application id? How I can get more detailed information about
application without this id (I am talking about /applications/[app-id]/jobs,
/applications/[app-id]/stages etc urls from
http://spark.apache.org/docs/1.6.0/monitoring.html)?

UI also shows me empty pages.

Without appId we cannot use other REST API calls. Is there any other way to
get RUNNING application ids?

Please help me understand what's going on.


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-06 Thread amarouni
You can get some more insights by using the Spark history server
(http://spark.apache.org/docs/latest/monitoring.html), it can show you
which task is failing and some other information that might help you
debugging the issue.


On 05/10/2016 19:00, Babak Alipour wrote:
> The issue seems to lie in the RangePartitioner trying to create equal
> ranges. [1]
>
> [1]
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.ml
> 
>  
>
>  The /Double/ values I'm trying to sort are mostly in the range [0,1]
> (~70% of the data which roughly equates 1 billion records), other
> numbers in the dataset are as high as 2000. With the RangePartitioner
> trying to create equal ranges, some tasks are becoming almost empty
> while others are extremely large, due to the heavily skewed distribution. 
>
> This is either a bug in Apache Spark or a major limitation of the
> framework. Has anyone else encountered this?
>
> */Babak Alipour ,/*
> */University of Florida/*
>
> On Sun, Oct 2, 2016 at 1:38 PM, Babak Alipour  > wrote:
>
> Thanks Vadim for sharing your experience, but I have tried
> multi-JVM setup (2 workers), various sizes for
> spark.executor.memory (8g, 16g, 20g, 32g, 64g) and
> spark.executor.core (2-4), same error all along.
>
> As for the files, these are all .snappy.parquet files, resulting
> from inserting some data from other tables. None of them actually
> exceeds 25MiB (I don't know why this number) Setting the DataFrame
> to persist using StorageLevel.MEMORY_ONLY shows size in memory at
> ~10g.  I still cannot understand why it is trying to create such a
> big page when sorting. The entire column (this df has only 1
> column) is not that big, neither are the original files. Any ideas?
>
>
> >Babak
>
>
>
> */Babak Alipour ,/*
> */University of Florida/*
>
> On Sun, Oct 2, 2016 at 1:45 AM, Vadim Semenov
> >
> wrote:
>
> oh, and try to run even smaller executors, i.e. with
> `spark.executor.memory` <= 16GiB. I wonder what result you're
> going to get.
>
> On Sun, Oct 2, 2016 at 1:24 AM, Vadim Semenov
>  > wrote:
>
> > Do you mean running a multi-JVM 'cluster' on the single
> machine? 
> Yes, that's what I suggested.
>
> You can get some information here: 
> 
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> 
> 
>
> > How would that affect performance/memory-consumption? If
> a multi-JVM setup can handle such a large input, then why
> can't a single-JVM break down the job into smaller tasks?
> I don't have an answer to these questions, it requires
> understanding of Spark, JVM, and your setup internal.
>
> I ran into the same issue only once when I tried to read a
> gzipped file which size was >16GiB. That's the only time I
> had to meet
> this 
> https://github.com/apache/spark/blob/5d84c7fd83502aeb551d46a740502db4862508fe/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L238-L243
> 
> 
> In the end I had to recompress my file into bzip2 that is
> splittable to be able to read it with spark.
>
>
> I'd look into size of your files and if they're huge I'd
> try to connect the error you got to the size of the files
> (but it's strange to me as a block size of a Parquet file
> is 128MiB). I don't have any other suggestions, I'm sorry.
>
>
> On Sat, Oct 1, 2016 at 11:35 PM, Babak Alipour
> >
> wrote:
>
> Do you mean running a multi-JVM 'cluster' on the
> single machine? How would that affect
> performance/memory-consumption? If a multi-JVM setup
> can handle such a large input, then why can't a
> single-JVM break down the job into smaller tasks?
>
> I also found that SPARK-9411 mentions making the
> page_size configurable but it's hard-limited
> to ((1L<<31) -1) *8L [1]
>
> [1]
> 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
> 
> 

Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-06 Thread Marco Mistroni
Thanks Fred
The build/mvn will trigger compilation using zinc and I want to avoid that
as every time I have tried it runs into errors while compiling spark core.
How can I disable zinc by default?
Kr

On 5 Oct 2016 10:53 pm, "Fred Reiss"  wrote:

> Actually the memory options *are* required for Java 1.8. Without them the
> build will fail intermittently. We just updated the documentation with
> regard to this fact in Spark 2.0.1. Relevant PR is here:
> https://github.com/apache/spark/pull/15005
>
> Your best bet as the project transitions from Java 7 to Java 8 is to use
> the scripts build/mvn and build/sbt, which should be updated on a regular
> basis with safe JVM options.
>
> Fred
>
> On Wed, Oct 5, 2016 at 1:40 AM, Marco Mistroni 
> wrote:
>
>> Thanks Richard.  It also says that for Java 1.8 the mavenopts are not
>> required..unless I misinterpreted the instructions...
>> Kr
>>
>> On 5 Oct 2016 9:20 am, "Richard Siebeling"  wrote:
>>
>>> sorry, now with the link included, see http://spark.apache.org/do
>>> cs/latest/building-spark.html
>>>
>>> On Wed, Oct 5, 2016 at 10:19 AM, Richard Siebeling >> > wrote:
>>>
 Hi,

 did you set the following option: export MAVEN_OPTS="-Xmx2g
 -XX:ReservedCodeCacheSize=512m"

 kind regards,
 Richard

 On Tue, Oct 4, 2016 at 10:21 PM, Marco Mistroni 
 wrote:

> Hi all
>  my mvn build of Spark 2.1 using Java 1.8 is spinning out of memory
> with an error saying it cannot allocate enough memory during maven
> compilation
>
> Instructions (in the Spark 2.0 page) says that MAVENOPTS are not
> needed for Java 1.8 and , accoding to my understanding, spark build 
> process
> will add it
> during the build via mvn
> Note; i am not using Zinc. Rather, i am using my own Maven version
> (3.3.9), launching this command from the main spark directory. The same
> build works when i use Java 1.7(and MAVENOPTS)
>
> mvn -Pyarn -Dscala-2.11 -DskipTests clean package
>
> Could anyone assist?
> kr
>   marco
>


>>>
>


Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR,
https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347

+1 for creating a JIRA and PR. If you have any problem with this, I would
like to do this quickly.


On 5 Oct 2016 9:12 p.m., "Laurent Legrand"  wrote:

> Hello,
>
> When I try to load multiple text files with the sqlContext, I get the
> following error:
>
> spark-2.0.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/
> sql/readwriter.py",
> line 282, in text
> UnboundLocalError: local variable 'path' referenced before assignment
>
> According to the code
> (https://github.com/apache/spark/blob/master/python/pyspark/
> sql/readwriter.py#L291),
> the variable 'path' is not set if the argument is not a string.
>
> Could you confirm it is a bug?
>
> Regards,
> Laurent
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-sqlContext-read-text-does-not-
> work-with-a-list-of-paths-tp27838.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>