Re: get -101 error code when running select query
I have seen a similar error message when connecting to Hive through JDBC. This is just a guess on my part, but check your query. The error occurs if you have a select that includes a null literal with an alias like this: select a, b, null as c, d from foo In my case, rewriting the query to use an empty string or other literal instead of null worked: select a, b, '' as c, d from foo I think the problem is the lack of type information when supplying a null literal. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/get-101-error-code-when-running-select-query-tp6377p6382.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Spark 1.0.0 rc3
I'm guessing EC2 support is not there yet? I was able to build using the binary download on both Windows 7 and RHEL 6 without issues. I tried to create an EC2 cluster, but saw this: ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version The spark dir on the EC2 master has only a conf dir, so it didn't deploy properly. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took 4.814991993 s res1: Double = 5.05E11 I used all defaults, no config files were changed. Not sure if that makes a difference... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I built rc5 using sbt/sbt assembly on Linux without any problems. There used to be an sbt.cmd for Windows build, has that been deprecated? If so, I can document the Windows build steps that worked for me. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Sorting partitions in Java
Thanks Sean, I had seen that post you mentioned. What you suggest looks an in-memory sort, which is fine if each partition is small enough to fit in memory. Is it true that rdd.sortByKey(...) requires partitions to fit in memory? I wasn't sure if there was some magic behind the scenes that supports arbitrarily large sorts. None of this is a show stopper, it just might require a little more code on the part of the developer. If there's a requirement for Spark partitions to fit in memory, developers will have to be aware of that and plan accordingly. One nice feature of Hadoop MR is the ability to sort very large sets without thinking about data size. In the case that a developer repartitions an RDD such that some partitions don't fit in memory, sorting those partitions requires more work. For these cases, I think there is value in having a robust partition sorting method that deals with it efficiently and reliably. Is there another solution for sorting arbitrarily large partitions? If not, I don't mind developing and contributing a solution. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Sorting partitions in Java
Sean, No, I don't want to sort the whole RDD, sortByKey seems to be good enough for that. Right now, I think the code I have will work for me, but I can imagine conditions where it will run out of memory. I'm not completely sure if SPARK-983 https://issues.apache.org/jira/browse/SPARK-983Andrew mentioned covers the rdd.sortPartitions() use case. Can someone comment on the scope of SPARK-983? Thanks! - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6725.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Eclipse Scala IDE/Scala test and Wiki
I was able to set up Spark in Eclipse using the Spark IDE plugin. I also got unit tests running with Scala Test, which makes development quick and easy. I wanted to document the setup steps in this wiki page: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup I can't seem to edit that page. Confluence usually has a an Edit button in the upper right, but it does not appear for me, even though I am logged in. Am I missing something? - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Eclipse-Scala-IDE-Scala-test-and-Wiki-tp6908.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Buidling spark in Eclipse Kepler
Ron, I was able to build core in Eclipse following these steps: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-Eclipse I was working only on core, so I know that works in Eclipse Juno. I haven't tried yarn or other Eclipse releases. Are you able to build *core* in Eclipse Kepler? In my view, tool independence is a good thing. I'll do what I can to support Eclipse. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Buidling-spark-in-Eclipse-Kepler-tp7712p7730.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Unit test best practice for Spark-derived projects
How long does it take to get a spark context? I found that if you don't have a network connection (reverse DNS lookup most likely), it can take up 30 seconds to start up locally. I think a hosts file entry is sufficient. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Unit-test-best-practice-for-Spark-derived-projects-tp7704p7731.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Handling stale PRs
Sean Owen wrote Stale JIRAs are a symptom, not a problem per se. I also want to see the backlog cleared, but automatically closing doesn't help, if the problem is too many JIRAs and not enough committer-hours to look at them. Some noise gets closed, but some easy or important fixes may disappear as well. Agreed. All of the problems mentioned in this thread are symptoms. There's no shortage of talent and enthusiasm within the Spark community. The people and the product are wonderful. The process: not so much. Spark has been wildly successful, some growing pains are to be expected. Given 100+ contributors, Spark is a big project. As with big data, big projects can run into scaling issues. There's no magic to running a successful big project, but it does require greater planning and discipline. JIRA is great for issue tracking, but it's not a replacement for a project plan. Quarterly releases are a great idea, everyone knows the schedule. What we need is concise plan for each release with a clear scope statement. Without knowing what is in scope and out of scope for a release, we end up with a laundry list of things to do, but no clear goal. Laundry lists don't scale well. I don't mind helping with planning and documenting releases. This is especially helpful for new contributors who don't know where to start. I have done that successfully on many projects using Jira and Confluence, so I know it can be done. To address immediate concerns of open PRs and excessive, overlapping Jira issues, we probably have to create a meta issue and assign resources to fix it. I don't mind helping with that also. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8031.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Handling stale PRs
Nicholas Chammas wrote Dunno how many committers Discourse has, but it looks like they've managed their PRs well. I hope we can do as well in this regard as they have. Discourse developers appear to eat their own dog food https://meta.discourse.org . Improved collaboration and a shared vision might be a reason for their success. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8061.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Jira tickets for starter tasks
Cheng Lian-2 wrote You can just start the work :) Given 100+ contributors, starting work without a JIRA issue assigned to you could lead to duplication of effort by well meaning people that have no idea they are working on the same issue. This does happen and I don't think it's a good thing. Just my $0.02 - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Jira-tickets-for-starter-tasks-tp8102p8127.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted
Thanks Patrick. I've been testing some 1.2 features, looks good so far. I have some example code that I think will be helpful for certain MR-style use cases (secondary sort). Can I still add that to the 1.2 documentation, or is that frozen at this point? - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
+1 (non-binding) Built and tested on Windows 7: cd apache-spark git fetch git checkout v1.2.0-rc2 sbt assembly [warn] ... [warn] [success] Total time: 720 s, completed Dec 11, 2014 8:57:36 AM dir assembly\target\scala-2.10\spark-assembly-1.2.0-hadoop1.0.4.jar 110,361,054 spark-assembly-1.2.0-hadoop1.0.4.jar Ran some of my 1.2 code successfully. Review some docs, looks good. spark-shell.cmd works as expected. Env details: sbtconfig.txt: -Xmx1024M -XX:MaxPermSize=256m -XX:ReservedCodeCacheSize=128m sbt --version sbt launcher version 0.13.1 - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC2-tp9713p9728.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RDD data flow
I was looking at some of the Partition implementations in core/rdd and getOrCompute(...) in CacheManager. It appears that getOrCompute(...) returns an InterruptibleIterator, which delegates to a wrapped Iterator. That would imply that Partitions should extend Iterator, but that is not always the case. For example, these Partitions for these RDDs do not extend Iterator: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing something. On a related subject, I was thinking of documenting the data flow of RDDs in more detail. The code is not hard to follow, but it's nice to have a simple picture with the major components and some explanation of the flow. The declaration of Partition is throwing me off. Thanks! - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD data flow
Patrick Wendell wrote The Partition itself doesn't need to be an iterator - the iterator comes from the result of compute(partition). The Partition is just an identifier for that partition, not the data itself. OK, that makes sense. The docs for Partition are a bit vague on this point. Maybe I'll add this to the docs. Thanks Patrick! - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804p9820.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Detecting configuration problems
Thanks Akhil! I suspect the root cause of the shuffle OOM I was seeing (and probably many that users might see) is due to individual partitions on the reduce side not fitting in memory. As a guideline, I was thinking of something like "be sure that your largest partitions occupy no more then 1% of executor memory" or something to that effect. I can add that documentation to the tuning page if someone can suggest the the best wording and numbers. I can also add a simple Spark shell example to estimate largest partition size to determine executor memory and number of partitions. One more question: I'm trying to get my head around the shuffle code. I see ShuffleManager, but that seems to be on the reduce side. Where is the code driving the map side writes and reduce reads? I think it is possible to add up reduce side volume for a key (they are byte reads at some point) and raise an alarm if it's getting too high. Even a warning on the console would be better than a catastrophic OOM. - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980p13998.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark-shell 1.5 doesn't seem to work in local mode
Thanks guys. I do have HADOOP_INSTALL set, but Spark 1.4.1 did not seem to mind. Seems like there's a difference in behavior between 1.5.0 and 1.4.1 for some reason. To the best of my knowledge, I just downloaded each tgz and untarred them in /opt I adjusted my PATH to point to one or the other, but that should be about it. Does 1.5.0 pick up HADOOP_INSTALL? Wouldn't spark-shell --master local override that? 1.5 seemed to completely ignore --master local - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-1-5-doesn-t-seem-to-work-in-local-mode-tp14212p14217.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark-shell 1.5 doesn't seem to work in local mode
059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.ConnectException: Call From ltree1/127.0.0.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy21.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400) at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:596) at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) ... 56 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) ... 76 more :10: error: not found: value sqlContext import sqlContext.implicits._ ^ :10: error: not found: value sqlContext import sqlContext.sql ^ - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-shell-1-5-doesn-t-seem-to-work-in-local-mode-tp14212.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Help needed to publish SizeEstimator as separate library
Hi, As I was going through spark source code, SizeEstimator https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala caught my eye. It's a very useful tool to do the size estimations on JVM which helps in use cases like memory bounded cache. It will be useful to have this as separate library, which can be used in the other projects too. There was a discussion https://spark-project.atlassian.net/browse/SPARK-383 long back, but i don't see any updates on it. I have extracted the code and packaged as separate project on github https://github.com/phatak-dev/java-sizeof. I have simplified the code to remove dependencies from google-guava and OpenHashSet which leads to a small compromise in accuracy in big arrays. But at same time, it greatly simplifies the code base and dependency graph. I want to publish it to maven central so it can be added as dependency. Though I have published code under my package com.madhu with keeping license information, I am not sure is it the right way to do. So it will be great if someone can guide me on package naming and attribution. -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
Re: Contributing Documentation Changes
Hi, I understand that. The following page http://spark.apache.org/documentation.html has a external tutorials,blogs section which points to other blog pages. I wanted to add there. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote: I think that your own tutorials and such should live on your blog. The goal isn't to pull in a bunch of external docs to the site. On Fri, Apr 24, 2015 at 12:57 AM, madhu phatak phatak@gmail.com wrote: Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara Phatak http://datamantra.io/
Contributing Documentation Changes
Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara Phatak http://datamantra.io/
Review of ML PR
Hi, I have provided a PR around 2 months back to improve the performance of decision tree by allowing flexible user provided storage class for intermediate data. I have posted few questions about handling backward compatibility but there is no answers from long. Can anybody help me to move this forward? The below is the link to PR https://github.com/apache/spark/pull/17972 -- Regards, Madhukara Phatak http://datamantra.io/
RandomForest caching
Hi, I am testing RandomForestClassification with 50gb of data which is cached in memory. I have 64gb of ram, in which 28gb is used for original dataset caching. When I run random forest, it caches around 300GB of intermediate data which un caches the original dataset. This caching is triggered by below code in RandomForest.scala ``` val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, seed) .persist(StorageLevel.MEMORY_AND_DISK) ``` As I don't have control over storage level, I cannot make sure original dataset stays in memory for other interactive tasks when random forest is running. Is it a good idea to make this storage level a user parameter? If so I can open a jira issue and give pr for the same. -- Regards, Madhukara Phatak http://datamantra.io/
Re: RandomForest caching
Hi, I opened a jira. https://issues.apache.org/jira/browse/SPARK-20723 Can some one have a look? On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak <phatak@gmail.com> wrote: > Hi, > > I am testing RandomForestClassification with 50gb of data which is cached > in memory. I have 64gb of ram, in which 28gb is used for original dataset > caching. > > When I run random forest, it caches around 300GB of intermediate data > which un caches the original dataset. This caching is triggered by below > code in RandomForest.scala > > ``` > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, > numTrees, withReplacement, seed) > .persist(StorageLevel.MEMORY_AND_DISK) > > ``` > > As I don't have control over storage level, I cannot make sure original > dataset stays in memory for other interactive tasks when random forest is > running. > > Is it a good idea to make this storage level a user parameter? If so I can > open a jira issue and give pr for the same. > > -- > Regards, > Madhukara Phatak > http://datamantra.io/ > -- Regards, Madhukara Phatak http://datamantra.io/
Time window on Processing Time
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards, Madhukara Phatak http://datamantra.io/
Re: Time window on Processing Time
Hi, That's great. Thanks a lot. On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > Yes, it can be! There is a sql function called current_timestamp() which > is self-explanatory. So I believe you should be able to do something like > > import org.apache.spark.sql.functions._ > > ds.withColumn("processingTime", current_timestamp()) > .groupBy(window("processingTime", "1 minute")) > .count() > > > On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak <phatak@gmail.com> > wrote: > >> Hi, >> As I am playing with structured streaming, I observed that window >> function always requires a time column in input data.So that means it's >> event time. >> >> Is it possible to old spark streaming style window function based on >> processing time. I don't see any documentation on the same. >> >> -- >> Regards, >> Madhukara Phatak >> http://datamantra.io/ >> > > -- Regards, Madhukara Phatak http://datamantra.io/