Re: get -101 error code when running select query

2014-04-23 Thread Madhu
I have seen a similar error message when connecting to Hive through JDBC.
This is just a guess on my part, but check your query. The error occurs if
you have a select that includes a null literal with an alias like this:

select a, b, null as c, d from foo

In my case, rewriting the query to use an empty string or other literal
instead of null worked:

select a, b, '' as c, d from foo

I think the problem is the lack of type information when supplying a null
literal.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/get-101-error-code-when-running-select-query-tp6377p6382.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Spark 1.0.0 rc3

2014-05-01 Thread Madhu
I'm guessing EC2 support is not there yet?

I was able to build using the binary download on both Windows 7 and RHEL 6
without issues.
I tried to create an EC2 cluster, but saw this:

~/spark-ec2
Initializing spark
~ ~/spark-ec2
ERROR: Unknown Spark version
Initializing shark
~ ~/spark-ec2 ~/spark-ec2
ERROR: Unknown Shark version

The spark dir on the EC2 master has only a conf dir, so it didn't deploy
properly.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Madhu
I just built rc5 on Windows 7 and tried to reproduce the problem described in

https://issues.apache.org/jira/browse/SPARK-1712

It works on my machine:

14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished
in 4.548 s
14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took
4.814991993 s
res1: Double = 5.05E11

I used all defaults, no config files were changed.
Not sure if that makes a difference...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-14 Thread Madhu
I built rc5 using sbt/sbt assembly on Linux without any problems.
There used to be an sbt.cmd for Windows build, has that been deprecated?
If so, I can document the Windows build steps that worked for me.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Thanks Sean, I had seen that post you mentioned.

What you suggest looks an in-memory sort, which is fine if each partition is
small enough to fit in memory. Is it true that rdd.sortByKey(...) requires
partitions to fit in memory? I wasn't sure if there was some magic behind
the scenes that supports arbitrarily large sorts.

None of this is a show stopper, it just might require a little more code on
the part of the developer. If there's a requirement for Spark partitions to
fit in memory, developers will have to be aware of that and plan
accordingly. One nice feature of Hadoop MR is the ability to sort very large
sets without thinking about data size.

In the case that a developer repartitions an RDD such that some partitions
don't fit in memory, sorting those partitions requires more work. For these
cases, I think there is value in having a robust partition sorting method
that deals with it efficiently and reliably.

Is there another solution for sorting arbitrarily large partitions? If not,
I don't mind developing and contributing a solution.




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Sorting partitions in Java

2014-05-20 Thread Madhu
Sean,

No, I don't want to sort the whole RDD, sortByKey seems to be good enough
for that.

Right now, I think the code I have will work for me, but I can imagine
conditions where it will run out of memory.

I'm not completely sure if  SPARK-983
https://issues.apache.org/jira/browse/SPARK-983Andrew mentioned covers
the rdd.sortPartitions() use case. Can someone comment on the scope of
SPARK-983?

Thanks!



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6725.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Eclipse Scala IDE/Scala test and Wiki

2014-06-02 Thread Madhu
I was able to set up Spark in Eclipse using the Spark IDE plugin.
I also got unit tests running with Scala Test, which makes development quick
and easy.

I wanted to document the setup steps in this wiki page:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup

I can't seem to edit that page.
Confluence usually has a an Edit button in the upper right, but it does
not appear for me, even though I am logged in.

Am I missing something?



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Scala-IDE-Scala-test-and-Wiki-tp6908.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Buidling spark in Eclipse Kepler

2014-08-07 Thread Madhu
Ron,

I was able to build core in Eclipse following these steps:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-Eclipse

I was working only on core, so I know that works in Eclipse Juno.
I haven't tried yarn or other Eclipse releases.
Are you able to build *core* in Eclipse Kepler?

In my view, tool independence is a good thing.
I'll do what I can to support Eclipse.



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Buidling-spark-in-Eclipse-Kepler-tp7712p7730.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unit test best practice for Spark-derived projects

2014-08-07 Thread Madhu
How long does it take to get a spark context?
I found that if you don't have a network connection (reverse DNS lookup most
likely), it can take up 30 seconds to start up locally. I think a hosts file
entry is sufficient.



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Unit-test-best-practice-for-Spark-derived-projects-tp7704p7731.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Madhu
Sean Owen wrote
 Stale JIRAs are a symptom, not a problem per se. I also want to see
 the backlog cleared, but automatically closing doesn't help, if the
 problem is too many JIRAs and not enough committer-hours to look at
 them. Some noise gets closed, but some easy or important fixes may
 disappear as well.

Agreed. All of the problems mentioned in this thread are symptoms. There's
no shortage of talent and enthusiasm within the Spark community. The people
and the product are wonderful. The process: not so much. Spark has been
wildly successful, some growing pains are to be expected.

Given 100+ contributors, Spark is a big project. As with big data, big
projects can run into scaling issues. There's no magic to running a
successful big project, but it does require greater planning and discipline.
JIRA is great for issue tracking, but it's not a replacement for a project
plan. Quarterly releases are a great idea, everyone knows the schedule. What
we need is concise plan for each release with a clear scope statement.
Without knowing what is in scope and out of scope for a release, we end up
with a laundry list of things to do, but no clear goal. Laundry lists don't
scale well.

I don't mind helping with planning and documenting releases. This is
especially helpful for new contributors who don't know where to start. I
have done that successfully on many projects using Jira and Confluence, so I
know it can be done. To address immediate concerns of open PRs and
excessive, overlapping Jira issues, we probably have to create a meta issue
and assign resources to fix it. I don't mind helping with that also.



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8031.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Madhu
Nicholas Chammas wrote
 Dunno how many committers Discourse has, but it looks like they've managed
 their PRs well. I hope we can do as well in this regard as they have.

Discourse developers appear to  eat their own dog food
https://meta.discourse.org  .
Improved collaboration and a shared vision might be a reason for their
success.




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8061.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Jira tickets for starter tasks

2014-08-29 Thread Madhu
Cheng Lian-2 wrote
 You can just start the work :)

Given 100+ contributors, starting work without a JIRA issue assigned to you
could lead to duplication of effort by well meaning people that have no idea
they are working on the same issue. This does happen and I don't think it's
a good thing.

Just my $0.02



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Jira-tickets-for-starter-tasks-tp8102p8127.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Madhu
Thanks Patrick.

I've been testing some 1.2 features, looks good so far.
I have some example code that I think will be helpful for certain MR-style
use cases (secondary sort).
Can I still add that to the 1.2 documentation, or is that frozen at this
point?



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Madhu
+1 (non-binding)

Built and tested on Windows 7:

cd apache-spark
git fetch
git checkout v1.2.0-rc2
sbt assembly
[warn]
...
[warn]
[success] Total time: 720 s, completed Dec 11, 2014 8:57:36 AM

dir assembly\target\scala-2.10\spark-assembly-1.2.0-hadoop1.0.4.jar
110,361,054 spark-assembly-1.2.0-hadoop1.0.4.jar

Ran some of my 1.2 code successfully.
Review some docs, looks good.
spark-shell.cmd works as expected.

Env details:
sbtconfig.txt:
-Xmx1024M
-XX:MaxPermSize=256m
-XX:ReservedCodeCacheSize=128m

sbt --version
sbt launcher version 0.13.1




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC2-tp9713p9728.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RDD data flow

2014-12-16 Thread Madhu
I was looking at some of the Partition implementations in core/rdd and
getOrCompute(...) in CacheManager.
It appears that getOrCompute(...) returns an InterruptibleIterator, which
delegates to a wrapped Iterator.
That would imply that Partitions should extend Iterator, but that is not
always the case.
For example, these Partitions for these RDDs do not extend Iterator:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing
something.

On a related subject, I was thinking of documenting the data flow of RDDs in
more detail. The code is not hard to follow, but it's nice to have a simple
picture with the major components and some explanation of the flow.  The
declaration of Partition is throwing me off.

Thanks!



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD data flow

2014-12-17 Thread Madhu
Patrick Wendell wrote
 The Partition itself doesn't need to be an iterator - the iterator
 comes from the result of compute(partition). The Partition is just an
 identifier for that partition, not the data itself.

OK, that makes sense. The docs for Partition are a bit vague on this point.
Maybe I'll add this to the docs.

Thanks Patrick!



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804p9820.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Detecting configuration problems

2015-09-08 Thread Madhu
Thanks Akhil!

I suspect the root cause of the shuffle OOM I was seeing (and probably many
that users might see) is due to individual partitions on the reduce side not
fitting in memory. As a guideline, I was thinking of something like "be sure
that your largest partitions occupy no more then 1% of executor memory" or
something to that effect. I can add that documentation to the tuning page if
someone can suggest the the best wording and numbers. I can also add a
simple Spark shell example to estimate largest partition size to determine
executor memory and number of partitions.

One more question: I'm trying to get my head around the shuffle code. I see
ShuffleManager, but that seems to be on the reduce side. Where is the code
driving the map side writes and reduce reads? I think it is possible to add
up reduce side volume for a key (they are byte reads at some point) and
raise an alarm if it's getting too high. Even a warning on the console would
be better than a catastrophic OOM.



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980p13998.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Madhu
Thanks guys.

I do have HADOOP_INSTALL set, but Spark 1.4.1 did not seem to mind.
Seems like there's a difference in behavior between 1.5.0 and 1.4.1 for some
reason.

To the best of my knowledge, I just downloaded each tgz and untarred them in
/opt
I adjusted my PATH to point to one or the other, but that should be about
it.

Does 1.5.0 pick up HADOOP_INSTALL?
Wouldn't spark-shell --master local override that?
1.5 seemed to completely ignore --master local



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-1-5-doesn-t-seem-to-work-in-local-mode-tp14212p14217.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Madhu
059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.ConnectException: Call From ltree1/127.0.0.1 to
localhost:9000 failed on connection exception: java.net.ConnectException:
Connection refused; For more details see: 
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy21.getFileInfo(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988)
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
at
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
at
org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:596)
at
org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 56 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 76 more

:10: error: not found: value sqlContext
   import sqlContext.implicits._
  ^
:10: error: not found: value sqlContext
   import sqlContext.sql
  ^




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-1-5-doesn-t-seem-to-work-in-local-mode-tp14212.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Help needed to publish SizeEstimator as separate library

2014-11-19 Thread madhu phatak
Hi,
 As I was going through spark source code, SizeEstimator
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
caught my eye. It's a very useful tool to do the size estimations on JVM
which helps in use cases like memory bounded cache.

It will be useful to have this as separate library, which can be used in
the other projects too. There was a discussion
https://spark-project.atlassian.net/browse/SPARK-383 long back, but i
don't see any updates on it.

I have extracted the code and packaged as separate project on github
https://github.com/phatak-dev/java-sizeof. I have simplified the code to
remove dependencies from google-guava and OpenHashSet which leads to a
small compromise in accuracy in big arrays. But at same time, it greatly
simplifies the code base and dependency graph. I want to publish it to
maven central so it can be added as dependency.

Though I have published code under my package com.madhu with keeping
license information, I am not sure is it the right way to do. So it will be
great if someone can guide me on package naming and attribution.

-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com


Re: Contributing Documentation Changes

2015-04-24 Thread madhu phatak
Hi,
I understand that. The following page

http://spark.apache.org/documentation.html has a external tutorials,blogs
section which points to other blog pages. I wanted to add there.




Regards,
Madhukara Phatak
http://datamantra.io/

On Fri, Apr 24, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote:

 I think that your own tutorials and such should live on your blog. The
 goal isn't to pull in a bunch of external docs to the site.

 On Fri, Apr 24, 2015 at 12:57 AM, madhu phatak phatak@gmail.com
 wrote:
  Hi,
   As I was reading contributing to Spark wiki, it was mentioned that we
 can
  contribute external links to spark tutorials. I have written many
  http://blog.madhukaraphatak.com/categories/spark/ of them in my blog.
 It
  will be great if someone can add it to the spark website.
 
 
 
  Regards,
  Madhukara Phatak
  http://datamantra.io/



Contributing Documentation Changes

2015-04-23 Thread madhu phatak
Hi,
 As I was reading contributing to Spark wiki, it was mentioned that we can
contribute external links to spark tutorials. I have written many
http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It
will be great if someone can add it to the spark website.



Regards,
Madhukara Phatak
http://datamantra.io/


Review of ML PR

2017-08-14 Thread madhu phatak
Hi,

I have provided a PR around 2 months back to improve the performance of
decision tree by allowing flexible user provided storage class for
intermediate data. I have posted few questions about handling backward
compatibility but there is no answers from long.

Can anybody help me to move this forward? The below is the link to PR

https://github.com/apache/spark/pull/17972

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


RandomForest caching

2017-04-28 Thread madhu phatak
Hi,

I am testing RandomForestClassification with 50gb of data which is cached
in memory. I have 64gb of ram, in which 28gb is used for original dataset
caching.

When I run random forest, it caches around 300GB of intermediate data which
un caches the original dataset. This caching is triggered by below code in
RandomForest.scala

```
val baggedInput = BaggedPoint
  .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
numTrees, withReplacement, seed)
  .persist(StorageLevel.MEMORY_AND_DISK)

```

As I don't have control over storage level, I cannot make sure original
dataset stays in memory for other interactive tasks when random forest is
running.

Is it a good idea to make this storage level a user parameter? If so I can
open a jira issue and give pr for the same.

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Re: RandomForest caching

2017-05-12 Thread madhu phatak
Hi,
I opened a jira.

https://issues.apache.org/jira/browse/SPARK-20723

Can some one have a look?

On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak <phatak@gmail.com> wrote:

> Hi,
>
> I am testing RandomForestClassification with 50gb of data which is cached
> in memory. I have 64gb of ram, in which 28gb is used for original dataset
> caching.
>
> When I run random forest, it caches around 300GB of intermediate data
> which un caches the original dataset. This caching is triggered by below
> code in RandomForest.scala
>
> ```
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
> numTrees, withReplacement, seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>
> ```
>
> As I don't have control over storage level, I cannot make sure original
> dataset stays in memory for other interactive tasks when random forest is
> running.
>
> Is it a good idea to make this storage level a user parameter? If so I can
> open a jira issue and give pr for the same.
>
> --
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>



-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi,
As I am playing with structured streaming, I observed that window function
always requires a time column in input data.So that means it's event time.

Is it possible to old spark streaming style window function based on
processing time. I don't see any documentation on the same.

-- 
Regards,
Madhukara Phatak
http://datamantra.io/


Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
Hi,
That's great. Thanks a lot.

On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> Yes, it can be! There is a sql function called current_timestamp() which
> is self-explanatory. So I believe you should be able to do something like
>
> import org.apache.spark.sql.functions._
>
> ds.withColumn("processingTime", current_timestamp())
>   .groupBy(window("processingTime", "1 minute"))
>   .count()
>
>
> On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com>
> wrote:
>
>> Hi,
>> As I am playing with structured streaming, I observed that window
>> function always requires a time column in input data.So that means it's
>> event time.
>>
>> Is it possible to old spark streaming style window function based on
>> processing time. I don't see any documentation on the same.
>>
>> --
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>
>


-- 
Regards,
Madhukara Phatak
http://datamantra.io/