[Phpmyadmin-git] [phpmyadmin/localized_docs] 4c1de7: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 4c1de76c08153d1db8f3132a2c994a27e89a4701 https://github.com/phpmyadmin/localized_docs/commit/4c1de76c08153d1db8f3132a2c994a27e89a4701 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-17 (Tue, 17 Feb 2015) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1667 of 1667 strings) [CI skip] -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Commented] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell
[ https://issues.apache.org/jira/browse/SPARK-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323795#comment-14323795 ] Burak Yavuz commented on SPARK-5811: The documentation is not really blocked, but I want to test what I write in the documentation before submitting a PR, and I see the issue SPARK-5857 Documentation for --packages and --repositories on Spark Shell -- Key: SPARK-5811 URL: https://issues.apache.org/jira/browse/SPARK-5811 Project: Spark Issue Type: Documentation Components: Deploy, Spark Shell Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Critical Fix For: 1.3.0 Documentation for the new support for dependency management using maven coordinates using --packages and --repositories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5857) pyspark PYTHONPATH not properly set up?
Burak Yavuz created SPARK-5857: -- Summary: pyspark PYTHONPATH not properly set up? Key: SPARK-5857 URL: https://issues.apache.org/jira/browse/SPARK-5857 Project: Spark Issue Type: Bug Components: Deploy, PySpark Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Blocker Locally, I run the following command: ```bin/pyspark --py-files ~/Projects/spark-csv/python/thunder/clustering/kmeans.py``` normally kmeans.py should be under PYTHONPATH, but when I check import os os.environ['PYTHONPATH'] '~/Documents/spark/python/lib/py4j-0.8.2.1-src.zip:~/Documents/spark/python/:' it's not there -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5810) Maven Coordinate Inclusion failing in pySpark
[ https://issues.apache.org/jira/browse/SPARK-5810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323474#comment-14323474 ] Burak Yavuz commented on SPARK-5810: Makes sense to add a regression test. I'll add it with the documentation PR which I'll submit today. I'll ping you on that one so that you can take a look. Maven Coordinate Inclusion failing in pySpark - Key: SPARK-5810 URL: https://issues.apache.org/jira/browse/SPARK-5810 Project: Spark Issue Type: Bug Components: Deploy, PySpark Affects Versions: 1.3.0 Reporter: Burak Yavuz Assignee: Josh Rosen Priority: Blocker Fix For: 1.3.0 When including maven coordinates to download dependencies in pyspark, pyspark returns a GatewayError, because it cannot read the proper port to communicate with the JVM. This is because pyspark relies on STDIN to read the port number and in the meantime Ivy prints out a whole lot of logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5810) Maven Coordinate Inclusion failing in pySpark
Burak Yavuz created SPARK-5810: -- Summary: Maven Coordinate Inclusion failing in pySpark Key: SPARK-5810 URL: https://issues.apache.org/jira/browse/SPARK-5810 Project: Spark Issue Type: Bug Components: Deploy, PySpark Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Blocker Fix For: 1.3.0 When including maven coordinates to download dependencies in pyspark, pyspark returns a GatewayError, because it cannot read the proper port to communicate with the JVM. This is because pyspark relies on STDIN to read the port number and in the meantime Ivy prints out a whole lot of logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell
Burak Yavuz created SPARK-5811: -- Summary: Documentation for --packages and --repositories on Spark Shell Key: SPARK-5811 URL: https://issues.apache.org/jira/browse/SPARK-5811 Project: Spark Issue Type: Documentation Components: Deploy, Spark Shell Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Critical Fix For: 1.3.0 Documentation for the new support for dependency management using maven coordinates using --packages and --repositories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 633212: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 63321222cd0e528555a6353d2d4e937216ef391c https://github.com/phpmyadmin/phpmyadmin/commit/63321222cd0e528555a6353d2d4e937216ef391c Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-12 (Thu, 12 Feb 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3012 of 3012 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 96b710: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 96b710be3d018132ab8c2cf9501ccb31d6ad2e68 https://github.com/phpmyadmin/phpmyadmin/commit/96b710be3d018132ab8c2cf9501ccb31d6ad2e68 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-12 (Thu, 12 Feb 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3012 of 3012 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: generate a random matrix with uniform distribution
Sorry about that, yes, it should be uniformVectorRDD. Thanks Sean! Burak On Mon, Feb 9, 2015 at 2:05 AM, Sean Owen so...@cloudera.com wrote: Yes the example given here should have used uniformVectorRDD. Then it's correct. On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini lucapug...@gmail.com wrote: Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks, Luca 2015-02-07 3:00 GMT+00:00 Burak Yavuz brk...@gmail.com: Hi, You can do the following: ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions, seed) // make the distribution uniform between (-1, 1) val data = dist.map(_ * 2 - 1) val matrix = new RowMatrix(data, n, k) On Feb 6, 2015 11:18 AM, Donbeo lucapug...@gmail.com wrote: Hi I would like to know how can I generate a random matrix where each element come from a uniform distribution in -1, 1 . In particular I would like the matrix be a distributed row matrix with dimension n x p Is this possible with mllib? Should I use another library? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/generate-a-random-matrix-with-uniform-distribution-tp21538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 09539d: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 09539d41e6ce4eca62ff02b0ecd47bcbfe3c2fee https://github.com/phpmyadmin/phpmyadmin/commit/09539d41e6ce4eca62ff02b0ecd47bcbfe3c2fee Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-09 (Mon, 09 Feb 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3011 of 3011 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] d53e06: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: d53e064ee7a01f8768e7453d0eef73b2921b44be https://github.com/phpmyadmin/phpmyadmin/commit/d53e064ee7a01f8768e7453d0eef73b2921b44be Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-08 (Sun, 08 Feb 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3009 of 3009 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: matrix of random variables with spark.
Forgot to add the more recent training material: https://databricks-training.s3.amazonaws.com/index.html On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz brk...@gmail.com wrote: Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions, seed) val matrix = new RowMatrix(data, n, k) ``` You can find more tutorials here: https://spark-summit.org/2013/exercises/index.html Best, Burak On Fri, Feb 6, 2015 at 10:03 AM, Luca Puggini lucapug...@gmail.com wrote: Hi all, this is my first email with this mailing list and I hope that I am not doing anything wrong. I am currently trying to define a distributed matrix with n rows and k columns where each element is randomly sampled by a uniform distribution. How can I do that? It would be also nice if you can suggest me any good guide that I can use to start working with Spark. (The quick start tutorial is not enough for me ) Thanks a lot !
Re: matrix of random variables with spark.
Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val data: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions, seed) val matrix = new RowMatrix(data, n, k) ``` You can find more tutorials here: https://spark-summit.org/2013/exercises/index.html Best, Burak On Fri, Feb 6, 2015 at 10:03 AM, Luca Puggini lucapug...@gmail.com wrote: Hi all, this is my first email with this mailing list and I hope that I am not doing anything wrong. I am currently trying to define a distributed matrix with n rows and k columns where each element is randomly sampled by a uniform distribution. How can I do that? It would be also nice if you can suggest me any good guide that I can use to start working with Spark. (The quick start tutorial is not enough for me ) Thanks a lot !
Re: generate a random matrix with uniform distribution
Hi, You can do the following: ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numPartitions, seed) // make the distribution uniform between (-1, 1) val data = dist.map(_ * 2 - 1) val matrix = new RowMatrix(data, n, k) On Feb 6, 2015 11:18 AM, Donbeo lucapug...@gmail.com wrote: Hi I would like to know how can I generate a random matrix where each element come from a uniform distribution in -1, 1 . In particular I would like the matrix be a distributed row matrix with dimension n x p Is this possible with mllib? Should I use another library? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/generate-a-random-matrix-with-uniform-distribution-tp21538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 13d1c0: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 13d1c0dacda739d0c6af60097be3788f01ca2964 https://github.com/phpmyadmin/phpmyadmin/commit/13d1c0dacda739d0c6af60097be3788f01ca2964 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-02-04 (Wed, 04 Feb 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3009 of 3009 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] b4d7a5: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: b4d7a519fa2825bf91611d98c3112679b1b5cba9 https://github.com/phpmyadmin/phpmyadmin/commit/b4d7a519fa2825bf91611d98c3112679b1b5cba9 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-28 (Wed, 28 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (3005 of 3005 strings) [CI skip] -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Created] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit
Burak Yavuz created SPARK-5341: -- Summary: Support maven coordinates in spark-shell and spark-submit Key: SPARK-5341 URL: https://issues.apache.org/jira/browse/SPARK-5341 Project: Spark Issue Type: New Feature Components: Deploy, Spark Shell Reporter: Burak Yavuz This feature will allow users to provide the maven coordinates of jars they wish to use in their spark application. Coordinates can be a comma-delimited list and be supplied like: ```spark-submit --maven org.apache.example.a,org.apache.example.b``` This feature will also be added to spark-shell (where it is more critical to have this feature) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5322) Add transpose() to BlockMatrix
Burak Yavuz created SPARK-5322: -- Summary: Add transpose() to BlockMatrix Key: SPARK-5322 URL: https://issues.apache.org/jira/browse/SPARK-5322 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Once Local matrices have the option to transpose, transposing a BlockMatrix will be trivial. Again, this will be a flag, which will in the end affect every SubMatrix in the RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5321) Add transpose() method to Matrix
Burak Yavuz created SPARK-5321: -- Summary: Add transpose() method to Matrix Key: SPARK-5321 URL: https://issues.apache.org/jira/browse/SPARK-5321 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz While we are working on BlockMatrix, it will be nice to add the support to transpose matrices. .transpose() will just modify a private flag in local matrices. Operations that follow will be performed based on this flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 22e00e: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 22e00e6a3578de1aede0ce06ef9e327c4bbe3f28 https://github.com/phpmyadmin/phpmyadmin/commit/22e00e6a3578de1aede0ce06ef9e327c4bbe3f28 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-09 (Fri, 09 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2996 of 2996 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 6f8431: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 6f8431a71d935b9710d8f5148b3941f21408052d https://github.com/phpmyadmin/phpmyadmin/commit/6f8431a71d935b9710d8f5148b3941f21408052d Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-08 (Thu, 08 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2995 of 2995 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 2eddd0: Translated using Weblate (Turkish)
Branch: refs/heads/QA_4_3 Home: https://github.com/phpmyadmin/phpmyadmin Commit: 2eddd0dc06e3f5ce3899fd2436b6b5541fcbcbfc https://github.com/phpmyadmin/phpmyadmin/commit/2eddd0dc06e3f5ce3899fd2436b6b5541fcbcbfc Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-01 (Thu, 01 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2982 of 2982 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 530c04: Translated using Weblate (Turkish)
Branch: refs/heads/QA_4_3 Home: https://github.com/phpmyadmin/phpmyadmin Commit: 530c04d14a9de6ba9b287b2a98306a09d04ee055 https://github.com/phpmyadmin/phpmyadmin/commit/530c04d14a9de6ba9b287b2a98306a09d04ee055 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-01 (Thu, 01 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2982 of 2982 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] f492a2: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: f492a2197d598a1618836719a47beaf16874ecfd https://github.com/phpmyadmin/phpmyadmin/commit/f492a2197d598a1618836719a47beaf16874ecfd Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-01 (Thu, 01 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2993 of 2993 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] d26bff: Translated using Weblate (Turkish)
Branch: refs/heads/QA_4_3 Home: https://github.com/phpmyadmin/phpmyadmin Commit: d26bffd0ae44354c4f47e6852368c48166e1ab1f https://github.com/phpmyadmin/phpmyadmin/commit/d26bffd0ae44354c4f47e6852368c48166e1ab1f Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2015-01-01 (Thu, 01 Jan 2015) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2982 of 2982 strings) [CI skip] -- Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: null Error in ALS model predict
Hi, The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method. Best, Burak - Original Message - From: Franco Barrientos franco.barrien...@exalitica.com To: user@spark.apache.org Sent: Wednesday, December 24, 2014 7:44:24 AM Subject: null Error in ALS model predict Hi all!, I have a RDD[(int,int,double,double)] where the first two int values are id and product, respectively. I trained an implicit ALS algorithm and want to make predictions from this RDD. I make two things but I think both ways are same. 1- Convert this RDD to RDD[(int,int)] and use model.predict(RDD(int,int)), this works to me! 2- Make a map and apply model.predict(int,int), for example: val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating, resp)= model.predict(id,rubro) } Where ratings is a RDD[Double]. Now, the second way when I apply a ratings.first() I get the follow error: Why this happend? I need to use this second way. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com http://www.exalitica.com/ www.exalitica.com http://exalitica.com/web/img/frim.png
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 0e0eda: Translated using Weblate (Turkish)
Branch: refs/heads/QA_4_3 Home: https://github.com/phpmyadmin/phpmyadmin Commit: 0e0eda5ff1f54eb07b26e9c46db734ff1eee966c https://github.com/phpmyadmin/phpmyadmin/commit/0e0eda5ff1f54eb07b26e9c46db734ff1eee966c Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-12-16 (Tue, 16 Dec 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2978 of 2978 strings) [CI skip] -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/localized_docs] 3e6f0e: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 3e6f0edfc6e9be3c8cd45c4cb82b8d39afe8c9e6 https://github.com/phpmyadmin/localized_docs/commit/3e6f0edfc6e9be3c8cd45c4cb82b8d39afe8c9e6 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-12-09 (Tue, 09 Dec 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1657 of 1657 strings) [CI skip] -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: How can I make Spark Streaming count the words in a file in a unit test?
Hi, https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf contains some performance tests for streaming. There are examples of how to generate synthetic files during the test in that repo, maybe you can find some code snippets that you can use there. Best, Burak - Original Message - From: Emre Sevinc emre.sev...@gmail.com To: user@spark.apache.org Sent: Monday, December 8, 2014 2:36:41 AM Subject: How can I make Spark Streaming count the words in a file in a unit test? Hello, I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala at https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala . When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C. Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words. What am I missing? Below is the unit test file, and after that I've also included the code snippet that shows the countWords method: = StarterAppTest.java = import com.google.common.io.Files; import org.apache.spark.streaming.Duration; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import org.junit.*; import java.io.*; public class StarterAppTest { JavaStreamingContext ssc; File tempDir; @Before public void setUp() { ssc = new JavaStreamingContext(local, test, new Duration(3000)); tempDir = Files.createTempDir(); tempDir.deleteOnExit(); } @After public void tearDown() { ssc.stop(); ssc = null; } @Test public void testInitialization() { Assert.assertNotNull(ssc.sc()); } @Test public void testCountWords() { StarterApp starterApp = new StarterApp(); try { JavaDStreamString lines = ssc.textFileStream(tempDir.getAbsolutePath()); JavaPairDStreamString, Integer wordCounts = starterApp.countWords(lines); System.err.println(= Word Counts ===); wordCounts.print(); System.err.println(= Word Counts ===); ssc.start(); File tmpFile = new File(tempDir.getAbsolutePath(), tmp.txt); PrintWriter writer = new PrintWriter(tmpFile, UTF-8); writer.println(8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin); writer.close(); System.err.println(= Word Counts ===); wordCounts.print(); System.err.println(= Word Counts ===); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } Assert.assertTrue(true); } } = This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the calls to wordCounts.print(); does not print anything, whereas in StarterApp.java itself, they do. I've also added ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file in the directory that this Spark Streaming application was checking but this time it gave an error. For completeness, below is the wordCounts method: public JavaPairDStreamString, Integer countWords(JavaDStreamString lines) { JavaDStreamString words = lines.flatMap(new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Lists.newArrayList(SPACE.split(x)); } }); JavaPairDStreamString, Integer wordCounts = words.mapToPair( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) { return new Tuple2(s, 1); } }).reduceByKey((i1, i2) - i1 + i2); return wordCounts; } Kind regards Emre Sevinç - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] eed0ff: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: eed0ffa96b6ee739036175912c32fca25985bead https://github.com/phpmyadmin/phpmyadmin/commit/eed0ffa96b6ee739036175912c32fca25985bead Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-11-26 (Wed, 26 Nov 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2991 of 2991 strings) [CI skip] -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Created] (SPARK-4409) Additional (but limited) Linear Algebra Utils
Burak Yavuz created SPARK-4409: -- Summary: Additional (but limited) Linear Algebra Utils Key: SPARK-4409 URL: https://issues.apache.org/jira/browse/SPARK-4409 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils
[ https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-4409: --- Description: This ticket is to discuss the addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - *zeros: Generate a matrix consisting of zeros - *ones: Generate a matrix consisting of ones - *eye: Generate an identity matrix - *rand: Generate a matrix consisting of i.i.d. uniform random numbers - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers - *diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`. - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix. - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix. Additional (but limited) Linear Algebra Utils - Key: SPARK-4409 URL: https://issues.apache.org/jira/browse/SPARK-4409 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz Priority: Minor This ticket is to discuss the addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - *zeros: Generate a matrix consisting of zeros - *ones: Generate a matrix consisting of ones - *eye: Generate an identity matrix - *rand: Generate a matrix consisting of i.i.d. uniform random numbers - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers - *diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 346b62: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 346b62740ab25f5d325f4aa74aeadd8aad7236c4 https://github.com/phpmyadmin/phpmyadmin/commit/346b62740ab25f5d325f4aa74aeadd8aad7236c4 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-11-04 (Tue, 04 Nov 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2976 of 2976 strings) [CI skip] -- ___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Commented] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192731#comment-14192731 ] Burak Yavuz commented on SPARK-3974: Hi everyone, The design doc for Block Matrix abstractions and the work on matrix multiplication can be found here: goo.gl/zbU1Nz Let me know if you have any comments / suggestions. I will have the PR for this ready by next Friday hopefully. Block matrix abstracitons and partitioners -- Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Assignee: Burak Yavuz We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0
Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may help in pinpointing the bug. Thanks, Burak - Original Message - From: Ilya Ganelin ilgan...@gmail.com To: user user@spark.apache.org Sent: Monday, October 27, 2014 11:36:46 AM Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0 Hello all - I am attempting to run MLLib's ALS algorithm on a substantial test vector - approx. 200 million records. I have resolved a few issues I've had with regards to garbage collection, KryoSeralization, and memory usage. I have not been able to get around this issue I see below however: java.lang. ArrayIndexOutOfBoundsException: 6106 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS. scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) I do not have any negative indices or indices that exceed Int-Max. I have partitioned the input data into 300 partitions and my Spark config is below: .set(spark.executor.memory, 14g) .set(spark.storage.memoryFraction, 0.8) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryo.registrator, MyRegistrator) .set(spark.core.connection.ack.wait.timeout,600) .set(spark.akka.frameSize,50) .set(spark.yarn.executor.memoryOverhead,1024) Does anyone have any suggestions as to why i'm seeing the above error or how to get around it? It may be possible to upgrade to the latest version of Spark but the mechanism for doing so in our environment isn't obvious yet. -Ilya Ganelin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 7180bb: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 7180bb0f150e81dc6ceb0ff1e582bd85fdb69306 https://github.com/phpmyadmin/phpmyadmin/commit/7180bb0f150e81dc6ceb0ff1e582bd85fdb69306 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-16 (Thu, 16 Oct 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2973 of 2973 strings) [CI skip] -- Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Spark KMeans hangs at reduceByKey / collectAsMap
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot of calculation going on. Did you use a different value for the number of runs? If you look at the storage tab, does the data look balanced among executors? Best, Burak - Original Message - From: Ray ray-w...@outlook.com To: u...@spark.incubator.apache.org Sent: Tuesday, October 14, 2014 2:58:03 PM Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can see the application got 201 vCores. From the spark UI, I can see it got 201 executors (as shown below). http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_core.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_executor.png Thanks. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] ac28de: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: ac28dedf064d6f2064afb17d3311a929edd95dad https://github.com/phpmyadmin/localized_docs/commit/ac28dedf064d6f2064afb17d3311a929edd95dad Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-12 (Sun, 12 Oct 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1656 of 1656 strings) [CI skip] -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167152#comment-14167152 ] Burak Yavuz commented on SPARK-3434: [~ConcreteVitamin], any updates? Anything I can help out with? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 2079e9: Translated using Weblate (Turkish)
Branch: refs/heads/QA_4_2 Home: https://github.com/phpmyadmin/phpmyadmin Commit: 2079e9cd9abf4d76e50494ce4bf8f7c1d4999164 https://github.com/phpmyadmin/phpmyadmin/commit/2079e9cd9abf4d76e50494ce4bf8f7c1d4999164 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-06 (Mon, 06 Oct 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2768 of 2768 strings) [CI skip] -- Slashdot TV. Videos for Nerds. Stuff that Matters. http://pubads.g.doubleclick.net/gampad/clk?id=160591471iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/localized_docs] 38df14: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 38df143ca748c7a5236c70cb0c715ea948195184 https://github.com/phpmyadmin/localized_docs/commit/38df143ca748c7a5236c70cb0c715ea948195184 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-02 (Thu, 02 Oct 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1648 of 1648) [ci skip] -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 5288df: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 5288df43097df61237fe4d9320a56b0886ed11db https://github.com/phpmyadmin/phpmyadmin/commit/5288df43097df61237fe4d9320a56b0886ed11db Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-02 (Thu, 02 Oct 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2965 of 2965 strings) [CI skip] -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[Phpmyadmin-git] [phpmyadmin/localized_docs] 1169c4: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 1169c49661f124d4d617d1316d62404d598d30bf https://github.com/phpmyadmin/localized_docs/commit/1169c49661f124d4d617d1316d62404d598d30bf Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-10-02 (Thu, 02 Oct 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1657 of 1657 strings) [CI skip] -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: MLlib Linear Regression Mismatch
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) print lrm print lrm.weights print lrm.intercept lrm.predict([40]) output: pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50 [ 1.] 0.0 40.0 b) By perturbing the y a little bit, the model gives wrong results: data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(9.0, [10.0]), LabeledPoint(22.0, [20.0]), LabeledPoint(32.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be 1.09x -0.60 print lrm print lrm.weights print lrm.intercept lrm.predict([40]) Output: pyspark.mllib.regression.LinearRegressionModel object at 0x109666590 [ -8.20487463e+203] 0.0 -3.2819498532740317e+205 c) Same story here - wrong results. Actually nan: data = [ LabeledPoint(18.9, [3910.0]), LabeledPoint(17.0, [3860.0]), LabeledPoint(20.0, [4200.0]), LabeledPoint(16.6, [3660.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 print lrm print lrm.weights print lrm.intercept lrm.predict([4000]) Output:pyspark.mllib.regression.LinearRegressionModel object at 0x109666b90 [ nan] 0.0 nan Cheers Thanks k/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] 1c004d: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 1c004d7e341e8e0d4b5c17dcdc64181220725193 https://github.com/phpmyadmin/localized_docs/commit/1c004d7e341e8e0d4b5c17dcdc64181220725193 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-09-22 (Mon, 22 Sep 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1648 of 1648) [ci skip] -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Commented] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143484#comment-14143484 ] Burak Yavuz commented on SPARK-3631: Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could you please confirm/correct/add to my explanation above. Thanks! Add docs for checkpoint usage - Key: SPARK-3631 URL: https://issues.apache.org/jira/browse/SPARK-3631 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Andrew Ash We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core. Some content to consider for inclusion from [~brkyvz]: {quote} If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there. You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically). However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough. {quote} {quote} Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch. The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long. I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be performed before the lineage grows too long. {quote} A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Python version of kmeans
Hi, spark-1.0.1/examples/src/main/python/kmeans.py = Naive example for users to understand how to code in Spark spark-1.0.1/python/pyspark/mllib/clustering.py = Use this!!! Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py = Example on how to call KMeans. Feel free to use it as a template! Best, Burak - Original Message - From: MEETHU MATHEW meethu2...@yahoo.co.in To: user@spark.apache.org Sent: Wednesday, September 17, 2014 10:26:40 PM Subject: Python version of kmeans Hi all, I need the kmeans code written against Pyspark for some testing purpose. Can somebody tell me the difference between these two files. spark-1.0.1/examples/src/main/python/kmeans.py and spark-1.0.1/python/pyspark/mllib/clustering.py Thanks Regards, Meethu M - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Odd error when using a rdd map within a stream map
Hi, I believe it's because you're trying to use a Function of an RDD, in an RDD, which is not possible. Instead of using a `FunctionJavaRDDFloat`, could you try FunctionFloat, and `public Void call(Float arg0) throws Exception { ` and `System.out.println(arg0)` instead. I'm not perfectly sure of the semantics in Java, but this should be what you're actually trying to do. Best, Burak - Original Message - From: Filip Andrei andreis.fi...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, September 18, 2014 6:57:21 AM Subject: Odd error when using a rdd map within a stream map here i wrote a simpler version of the code to get an understanding of how it works: final ListNeuralNet nns = new ArrayListNeuralNet(); for(int i = 0; i numberOfNets; i++){ nns.add(NeuralNet.createFrom(...)); } final JavaRDDNeuralNet nnRdd = sc.parallelize(nns); JavaDStreamFloat results = rndLists.flatMap(new FlatMapFunctionMaplt;String,Object, Float() { @Override public IterableFloat call(MapString, Object input) throws Exception { Float f = nnRdd.map(new FunctionNeuralNet, Float() { @Override public Float call(NeuralNet nn) throws Exception { return 1.0f; } }).reduce(new Function2Float, Float, Float() { @Override public Float call(Float left, Float right) throws Exception { return left + right; } }); return Arrays.asList(f); } }); results.print(); This works as expected and print() simply shows the number of neural nets i have If instead a print() i use results.foreach(new FunctionJavaRDDlt;Float, Void() { @Override public Void call(JavaRDDFloat arg0) throws Exception { for(Float f : arg0.collect()){ System.out.println(f); } return null; } }); It fails with the following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.map(RDD.scala:270) This is weird to me since the same code executes as expected in one case and doesn't in the other, any idea what's going on here ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Odd-error-when-using-a-rdd-map-within-a-stream-map-tp14551.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on EC2
Hi Gilberto, Could you please attach the driver logs as well, so that we can pinpoint what's going wrong? Could you also add the flag `--driver-memory 4g` while submitting your application and try that as well? Best, Burak - Original Message - From: Gilberto Lira g...@scanboo.com.br To: user@spark.apache.org Sent: Thursday, September 18, 2014 11:48:03 AM Subject: Spark on EC2 Hello, I am trying to run a python script that makes use of the kmeans MLIB and I'm not getting anywhere. I'm using an c3.xlarge instance as master, and 10 c3.large instances as slaves. In the code I make a map of a 600MB csv file in S3, where each row has 128 integer columns. The problem is that around the TID7 my slave stops responding, and I can not finish my processing. Could you help me with this problem? I sending my script attached for review. Thank you, Gilberto - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] a01814: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: a01814147d950fa4fa4a4a9006a7c5690a9701b6 https://github.com/phpmyadmin/localized_docs/commit/a01814147d950fa4fa4a4a9006a7c5690a9701b6 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-09-17 (Wed, 17 Sep 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1653 of 1653) [ci skip] -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Spark and disk usage.
Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready allows Spark to continue from the stage the shuffle left off, instead of starting from the very beginning. Long story short, it's to your benefit that Spark writes those files to disk. If you don't want Spark writing to disk, you can specify a checkpoint directory in HDFS, where Spark will write the current status instead and will clean up files from disk. Best, Burak - Original Message - From: Макар Красноперов connector@gmail.com To: user@spark.apache.org Sent: Wednesday, September 17, 2014 7:37:49 AM Subject: Spark and disk usage. Hello everyone. The problem is that spark write data to the disk very hard, even if application has a lot of free memory (about 3.8g). So, I've noticed that folder with name like spark-local-20140917165839-f58c contains a lot of other folders with files like shuffle_446_0_1. The total size of files in the dir spark-local-20140917165839-f58c can reach 1.1g. Sometimes its size decreases (are there only temp files in that folder?), so the totally amount of data written to the disk is greater than 1.1g. The question is what kind of data Spark store there and can I make spark not to write it on the disk and just keep it in the memory if there is enough RAM free space? I run my job locally with Spark 1.0.1: ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar spark-defaults.conf : spark.shuffle.spill false spark.reducer.maxMbInFlight 1024 spark.shuffle.file.buffer.kb2048 spark.storage.memoryFraction0.7 The situation with disk usage is common for many jobs. I had also used ALS from MLIB and saw the similar things. I had reached no success by playing with spark configuration and i hope someone can help me :) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and disk usage.
Hi Andrew, Yes, I'm referring to sparkContext.setCheckpointDir(). It has the same effect as in Spark Streaming. For example, in an algorithm like ALS, the RDDs go through many transformations and the lineage of the RDD starts to grow drastically just like the lineage of DStreams do in Spark Streaming. You may observe StackOverflowErrors in ALS if you set the number of iterations to be very high. If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there. You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically). However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough. Best, Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user@spark.apache.org Sent: Wednesday, September 17, 2014 10:19:42 AM Subject: Re: Spark and disk usage. Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM, Burak Yavuz bya...@stanford.edu wrote: Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready allows Spark to continue from the stage the shuffle left off, instead of starting from the very beginning. Long story short, it's to your benefit that Spark writes those files to disk. If you don't want Spark writing to disk, you can specify a checkpoint directory in HDFS, where Spark will write the current status instead and will clean up files from disk. Best, Burak - Original Message - From: Макар Красноперов connector@gmail.com To: user@spark.apache.org Sent: Wednesday, September 17, 2014 7:37:49 AM Subject: Spark and disk usage. Hello everyone. The problem is that spark write data to the disk very hard, even if application has a lot of free memory (about 3.8g). So, I've noticed that folder with name like spark-local-20140917165839-f58c contains a lot of other folders with files like shuffle_446_0_1. The total size of files in the dir spark-local-20140917165839-f58c can reach 1.1g. Sometimes its size decreases (are there only temp files in that folder?), so the totally amount of data written to the disk is greater than 1.1g. The question is what kind of data Spark store there and can I make spark not to write it on the disk and just keep it in the memory if there is enough RAM free space? I run my job locally with Spark 1.0.1: ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar spark-defaults.conf : spark.shuffle.spill false spark.reducer.maxMbInFlight 1024 spark.shuffle.file.buffer.kb2048 spark.storage.memoryFraction0.7 The situation with disk usage is common for many jobs. I had also used ALS from MLIB and saw the similar things. I had reached no success by playing with spark configuration and i hope someone can help me :) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and disk usage.
Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch. The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long. I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be performed before the lineage grows too long. I believe it would be nice to write up checkpointing in the programming guide. The reason that it's not there yet I believe is that most applications don't grow such a long lineage, except in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user@spark.apache.org Sent: Wednesday, September 17, 2014 11:04:02 AM Subject: Re: Spark and disk usage. Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting maps), since only the shuffle leaves data on disk. Do you have any suggestions on this? We should write up some guidance on the use of checkpointing in the programming guide https://spark.apache.org/docs/latest/programming-guide.html - I can help with this Andrew - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator
Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine). It may be that the whole data is stored in one block? If it is billions of rows, then the indexing probably will not work giving the exceeds Integer.MAX_VALUE error. Best, Burak - Original Message - From: francisco ftanudj...@nextag.com To: u...@spark.incubator.apache.org Sent: Wednesday, September 17, 2014 3:18:29 PM Subject: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator Hi, We are running aggregation on a huge data set (few billion rows). While running the task got the following error (see below). Any ideas? Running spark 1.1.0 on cdh distribution. ... 14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 2083 bytes result sent to driver 14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 1 14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 2.0 (TID 1) 14/09/17 13:33:30 INFO TorrentBroadcast: Started reading broadcast variable 2 14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(1428) called with curMem=163719, maxMem=34451478282 14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1428.0 B, free 32.1 GB) 14/09/17 13:33:30 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/09/17 13:33:30 INFO TorrentBroadcast: Reading broadcast variable 2 took 0.027374294 s 14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(2336) called with curMem=165147, maxMem=34451478282 14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 32.1 GB) 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkdri...@sas-model1.pv.sv.nextag.com:56631/user/MapOutputTracker#794212052] 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Got the output locations 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms 14/09/17 13:33:30 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Error occurred while fetching local blocks java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:120) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:358) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:208) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:205) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.getLocalBlocks(BlockFetcherIterator.scala:205) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.initialize(BlockFetcherIterator.scala:240) at org.apache.spark.storage.BlockManager.getMultiple(BlockManager.scala:583) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:77) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:41) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/09/17 13:33:30 INFO
Re: MLLib: LIBSVM issue
Hi, The spacing between the inputs should be a single space, not a tab. I feel like your inputs have tabs between them instead of a single space. Therefore the parser cannot parse the input. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org Sent: Wednesday, September 17, 2014 7:25:10 PM Subject: MLLib: LIBSVM issue Hi All,We have a fairly large amount of sparse data. I was following the following instructions in the manual: Sparse dataIt is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:label index1:value1 index2:value2 ... import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.rdd.RDD val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) I believe that I have formatted my data as per the required Libsvm format. Here is a snippet of that: 1122:11693:11771:11974:12334:1 2378:12562:1 1118:11389:11413:11454:1 1780:12562:15051:15417:15548:1 5798:15862:1 0150:1214:1468:11013:1 1078:11092:11117:11489:11546:11630:1 1635:11827:12024:12215:12478:1 2761:15985:16115:16218:1 0251:15578:1 However,When I use MLUtils.loadLibSVMFile(sc, path-to-data-file)I get the following error messages in mt spark-shell. Can someone please point me in right direction. java.lang.NumberFormatException: For input string: 150:1214:1 468:11013:11078:11092:11117:11489:1 1546:11630:11635:11827:12024:12215:1 2478:12761:15985:16115:16218:1 at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241) at java.lang.Double.parseDouble(Double.java:540) at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [mllib] State of Multi-Model training
Hi Kyle, I'm actively working on it now. It's pretty close to completion, I'm just trying to figure out bottlenecks and optimize as much as possible. As Phase 1, I implemented multi model training on Gradient Descent. Instead of performing Vector-Vector operations on rows (examples) and weights, I've batched them into matrices so that we can use Level 3 BLAS to speed things up. I've also added support for Sparse Matrices (https://github.com/apache/spark/pull/2294) as making use of sparsity will allow you to train more models at once. Best, Burak - Original Message - From: Kyle Ellrott kellr...@soe.ucsc.edu To: dev@spark.apache.org Sent: Tuesday, September 16, 2014 3:21:53 PM Subject: [mllib] State of Multi-Model training I'm curious about the state of development Multi-Model learning in MLlib (training sets of models during the same training session, rather then one at a time). The JIRA lists it as in progress targeting Spark 1.2.0 ( https://issues.apache.org/jira/browse/SPARK-1486 ). But there hasn't been any notes on it in over a month. I submitted a pull request for a possible method to do this work a little over two months ago (https://github.com/apache/spark/pull/1292), but haven't yet received any feedback on the patch yet. Is anybody else working on multi-model training? Kyle - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL
Hi, I'm not a master on SparkSQL, but from what I understand, the problem ıs that you're trying to access an RDD inside an RDD here: val xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate ... *** and here: xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate ... ***. RDDs can't be serialized inside other RDD tasks, therefore you're receiving the NullPointerException. More specifically, you are trying to generate a SchemaRDD inside an RDD, which you can't do. If file isn't huge, you can call .collect() to transform the RDD to an array and then use .map() on the Array. If the file is huge, then you may do number 3 first, join the two RDDs using 'txCurCode' as a key, and then do filtering operations, etc... Best, Burak - Original Message - From: rkishore999 rkishore...@yahoo.com To: u...@spark.incubator.apache.org Sent: Saturday, September 13, 2014 10:29:26 PM Subject: Spark SQL val file = sc.textFile(hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt) 1. val xyz = file.map(line = extractCurRate(sqlContext.sql(select rate from CurrencyCodeRates where txCurCode = ' + line.substring(202,205) + ' and fxCurCode = ' + fxCurCodesMap(line.substring(77,82)) + ' and effectiveDate = ' + line.substring(221,229) + ' order by effectiveDate desc)) 2. val xyz = file.map(line = sqlContext.sql(select rate, txCurCode, fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode = 'CSD' and effectiveDate = '20140901' order by effectiveDate desc)) 3. val xyz = sqlContext.sql(select rate, txCurCode, fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode = 'CSD' and effectiveDate = '20140901' order by effectiveDate desc) xyz.saveAsTextFile(/user/output) In statements 1 and 2 I'm getting nullpointer expecption. But statement 3 is good. I'm guessing spark context and sql context are not going together well. Any suggestions regarding how I can achieve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp14183.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Filter function problem
Hi, val test = persons.value .map{tuple = (tuple._1, tuple._2 .filter{event = *inactiveIDs.filter(event2 = event2._1 == tuple._1).count() != 0})} Your problem is right between the asterisk. You can't make an RDD operation inside an RDD operation, because RDD's can't be serialized. Therefore you are receiving the NullPointerException. Try joining the RDDs based on `event` and then filter based on that. Best, Burak - Original Message - From: Blackeye black...@iit.demokritos.gr To: u...@spark.incubator.apache.org Sent: Tuesday, September 9, 2014 3:34:58 AM Subject: Re: Filter function problem In order to help anyone to answer i could say that i checked the inactiveIDs.filter operation seperated, and I found that it doesn't return null in any case. In addition i don't how to handle (or check) whether a RDD is null. I find the debugging to complicated to point the error. Any ideas how to find the null pointer? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Filter-function-problem-tp13787p13789.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] 6d551e: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 6d551e2fce7ea6e02e1194acc6a800a1af836b5b https://github.com/phpmyadmin/localized_docs/commit/6d551e2fce7ea6e02e1194acc6a800a1af836b5b Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-09-07 (Sun, 07 Sep 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 99.8% (1651 of 1653) [ci skip] -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Created] (SPARK-3418) Additional BLAS and Local Sparse Matrix support
Burak Yavuz created SPARK-3418: -- Summary: Additional BLAS and Local Sparse Matrix support Key: SPARK-3418 URL: https://issues.apache.org/jira/browse/SPARK-3418 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model training, adding support for Level-3 BLAS functions is vital. In addition, as most real data is sparse, support for Local Sparse Matrices will also be added, as supporting sparse matrices will save a lot of memory and will lead to better performance. The ability to left multiply a dense matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a sparse matrix will also be added. However, `B` and `C` will remain as Dense Matrices for now. I will post performance comparisons with other libraries that support sparse matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3418) [MLlib] Additional BLAS and Local Sparse Matrix support
[ https://issues.apache.org/jira/browse/SPARK-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-3418: --- Summary: [MLlib] Additional BLAS and Local Sparse Matrix support (was: Additional BLAS and Local Sparse Matrix support) [MLlib] Additional BLAS and Local Sparse Matrix support --- Key: SPARK-3418 URL: https://issues.apache.org/jira/browse/SPARK-3418 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Burak Yavuz Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model training, adding support for Level-3 BLAS functions is vital. In addition, as most real data is sparse, support for Local Sparse Matrices will also be added, as supporting sparse matrices will save a lot of memory and will lead to better performance. The ability to left multiply a dense matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a sparse matrix will also be added. However, `B` and `C` will remain as Dense Matrices for now. I will post performance comparisons with other libraries that support sparse matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] 1e0179: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 1e0179a5b88ed87450de23a73fac265a686d0476 https://github.com/phpmyadmin/localized_docs/commit/1e0179a5b88ed87450de23a73fac265a686d0476 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-30 (Sat, 30 Aug 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 99.8% (1650 of 1653) [ci skip] -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Updated] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-3280: --- Attachment: hash-sort-comp.png Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin Attachments: hash-sort-comp.png sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114873#comment-14114873 ] Burak Yavuz commented on SPARK-3280: I don't have as detailed a comparison like Josh has, but for MLlib algorithms, sort based shuffle didn't show the performance boosts Josh has shown. 16 m3.2xlarge instances were used for these experiments. The difference here is that the number of partitions I used were 128. Much less than the number of partitions Josh has shown. !hash-sort-comp.png! Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin Attachments: hash-sort-comp.png sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1. Tested MLlib algorithms on Amazon EC2, algorithms show speed-ups between 1.5-5x compared to the 1.0.2 release. - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Thursday, August 28, 2014 8:32:11 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC2) I'll kick off the vote with a +1. On Thu, Aug 28, 2014 at 7:14 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc2 (commit 711aebb3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=711aebb329ca28046396af1e34395a0df92b5327 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1029/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Monday, September 01, at 03:11 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == LZ4 compression issue: https://issues.apache.org/jira/browse/SPARK-3277 == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: OutofMemoryError when generating output
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 28, 2014 12:45:22 PM Subject: Re: OutofMemoryError when generating output Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line = val fields = line.split(\t) (fields(11), fields(6)) // extract (month, user_id) }.distinct().countByKey() x.saveAsTextFile(...) // does not work. generates an error that saveAstextFile is not defined for Map object Is there a way to convert the Map object to an object that I can output to console and to a file? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Memory statistics in the Application detail UI
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 28, 2014 6:32:32 PM Subject: Memory statistics in the Application detail UI Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for different nodes). What does the second number signify (i.e. 8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of the node, should it not be 17.3 KB/16 GB? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] d0e0ed: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: d0e0ed047816fa84ce213df88be75670c765eeb5 https://github.com/phpmyadmin/phpmyadmin/commit/d0e0ed047816fa84ce213df88be75670c765eeb5 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-27 (Wed, 27 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2964 of 2964) [ci skip] -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Amplab: big-data-benchmark
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings in lower case after the format and size you want them in. They should be there. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org Sent: Wednesday, August 27, 2014 11:42:28 AM Subject: Amplab: big-data-benchmark Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents For text, I get the following error: ErrorCodeNoSuchKey/CodeMessageThe specified key does not exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error Please let me know if there is a way to readily download the dataset and view it. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutofMemoryError when generating output
Hi, The error doesn't occur during saveAsTextFile but rather during the groupByKey as far as I can tell. We strongly urge users to not use groupByKey if they don't have to. What I would suggest is the following work-around: sc.textFile(baseFile)).map { line = val fields = line.split(\t) (fields(11), fields(6)) // extract (month, user_id) }.distinct().countByKey() instead Best, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Tuesday, August 26, 2014 12:38:00 PM Subject: OutofMemoryError when generating output Hi, I have the following piece of code that I am running on a cluster with 10 nodes with 2GB memory per node. The tasks seem to complete, but at the point where it is generating output (saveAsTextFile), the program freezes after some time and reports an out of memory error (error transcript attached below). I also tried using collect() and printing the output to console instead of a file, but got the same error. The program reads some logs for a month and extracts the number of unique users during the month. The reduced output is not very large, so not sure why the memory error occurs. I would appreciate any help in fixing this memory error to get the output. Thanks. def main (args: Array[String]) { val conf = new SparkConf().setAppName(App) val sc = new SparkContext(conf) // get the number of users per month val user_time = sc.union(sc.textFile(baseFile)) .map(line = { val fields = line.split(\t) (fields(11), fields(6)) }) // extract (month, user_id) .groupByKey // group by month as the key .map(g= (g._1, g._2.toSet.size)) // get the unique id count per month // .collect() // user_time.foreach(f = println(f)) user_time.map(f = %s, %s.format(f._1, f._2)).saveAsTextFile(app_output) sc.stop() } 14/08/26 15:21:15 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:121) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:107) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4.apply(PairRDDFunctions.scala:106) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: saveAsTextFile hangs with hdfs
Hi David, Your job is probably hanging on the groupByKey process. Probably GC is kicking in and the process starts to hang or the data is unbalanced and you end up with stragglers (Once GC kicks in you'll start to get the connection errors you shared). If you don't care about the list of values itself, but the count of it (that appears to be what you're trying to save, correct me if I'm wrong), then I would suggest using `countByKey()` directly on `JavaPairRDDString, AnalyticsLogFlyweight partitioned`. Best, Burak - Original Message - From: David david.b...@gmail.com To: user u...@spark.incubator.apache.org Sent: Tuesday, August 19, 2014 1:44:18 PM Subject: saveAsTextFile hangs with hdfs I have a simple spark job that seems to hang when saving to hdfs. When looking at the spark web ui, the job reached 97 of 100 tasks completed. I need some help determining why the job appears to hang. The job hangs on the saveAsTextFile() call. https://www.dropbox.com/s/fdp7ck91hhm9w68/Screenshot%202014-08-19%2010.53.24.png The job is pretty simple: JavaRDDString analyticsLogs = context .textFile(Joiner.on(,).join(hdfs.glob(/spark-dfs, .*\\.log$)), 4); JavaRDDAnalyticsLogFlyweight flyweights = analyticsLogs .map(line - { try { AnalyticsLog log = GSON.fromJson(line, AnalyticsLog.class); AnalyticsLogFlyweight flyweight = new AnalyticsLogFlyweight(); flyweight.ipAddress = log.getIpAddress(); flyweight.time = log.getTime(); flyweight.trackingId = log.getTrackingId(); return flyweight; } catch (Exception e) { LOG.error(error parsing json, e); return null; } }); JavaRDDAnalyticsLogFlyweight filtered = flyweights .filter(log - log != null); JavaPairRDDString, AnalyticsLogFlyweight partitioned = filtered .mapToPair((AnalyticsLogFlyweight log) - new Tuple2(log.trackingId, log)) .partitionBy(new HashPartitioner(100)).cache(); OrderingAnalyticsLogFlyweight ordering = Ordering.natural().nullsFirst().onResultOf(new FunctionAnalyticsLogFlyweight, Long() { public Long apply(AnalyticsLogFlyweight log) { return log.time; } }); JavaPairRDDString, IterableAnalyticsLogFlyweight stringIterableJavaPairRDD = partitioned.groupByKey(); JavaPairRDDString, Integer stringIntegerJavaPairRDD = stringIterableJavaPairRDD.mapToPair((log) - { ListAnalyticsLogFlyweight sorted = Lists.newArrayList(log._2()); sorted.forEach(l - LOG.info(sorted {}, l)); return new Tuple2(log._1(), sorted.size()); }); String outputPath = /summarized/groupedByTrackingId4; hdfs.rm(outputPath, true); stringIntegerJavaPairRDD.saveAsTextFile(String.format(%s/%s, hdfs.getUrl(), outputPath)); Thanks in advance, David
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 9cedf0: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 9cedf0a58feaad7604cdf7d09828854c11c630e6 https://github.com/phpmyadmin/phpmyadmin/commit/9cedf0a58feaad7604cdf7d09828854c11c630e6 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-25 (Mon, 25 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2962 of 2962) [ci skip] -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Finding Rank in Spark
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original Message - From: athiradas athira@flutura.com To: u...@spark.incubator.apache.org Sent: Friday, August 22, 2014 4:14:34 AM Subject: Re: Finding Rank in Spark Does anyone knw a way to do this? I tried it by sorting it and writing an auto increment function. But since its parallel computing the result is wrong. Is there anyway? please reply -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp12028p12647.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: LDA example?
You can check out this pull request: https://github.com/apache/spark/pull/476 LDA is on the roadmap for the 1.2 release, hopefully we will officially support it then! Best, Burak - Original Message - From: Denny Lee denny.g@gmail.com To: user@spark.apache.org Sent: Thursday, August 21, 2014 10:10:35 PM Subject: LDA example? Quick question - is there a handy sample / example of how to use the LDA algorithm within Spark MLLib? Thanks! Denny - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] b574ff: Translated using Weblate (Romanian)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: b574fff77c983dd103bc53ae1751030cbb7c17c1 https://github.com/phpmyadmin/phpmyadmin/commit/b574fff77c983dd103bc53ae1751030cbb7c17c1 Author: Alecsandru Prună alecsandrucpr...@gmail.com Date: 2014-08-16 (Sat, 16 Aug 2014) Changed paths: M po/ro.po Log Message: --- Translated using Weblate (Romanian) Currently translated at 57.7% (1713 of 2964) [ci skip] Commit: 75a7f0f9791bd737b4c6619879b82956ae1e3bfe https://github.com/phpmyadmin/phpmyadmin/commit/75a7f0f9791bd737b4c6619879b82956ae1e3bfe Author: Weblate nore...@weblate.org Date: 2014-08-16 (Sat, 16 Aug 2014) Changed paths: M test/libraries/PMA_SetupIndex_test.php Log Message: --- Merge remote-tracking branch 'origin/master' Commit: 21a01002926cd479b2e2592b4fbea827509fed14 https://github.com/phpmyadmin/phpmyadmin/commit/21a01002926cd479b2e2592b4fbea827509fed14 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-16 (Sat, 16 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2964 of 2964) [ci skip] Compare: https://github.com/phpmyadmin/phpmyadmin/compare/5c7c5ba5337d...21a01002926c-- ___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Created] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
Burak Yavuz created SPARK-3080: -- Summary: ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz The stack trace is below: ``` java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) ``` This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-3080: --- Description: The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances was: The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-3080: --- Description: The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings was: The stack trace is below: ``` java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] 2b61fb: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: 2b61fb1281e094a885f580f83d8381c7cca8bb04 https://github.com/phpmyadmin/phpmyadmin/commit/2b61fb1281e094a885f580f83d8381c7cca8bb04 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-13 (Wed, 13 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2961 of 2961) [ci skip] -- ___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
[jira] [Resolved] (SPARK-2833) performance tests for linear regression
[ https://issues.apache.org/jira/browse/SPARK-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-2833. Resolution: Fixed performance tests for linear regression --- Key: SPARK-2833 URL: https://issues.apache.org/jira/browse/SPARK-2833 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz linear regression, lasso, and ridge -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2837) performance tests for ALS
[ https://issues.apache.org/jira/browse/SPARK-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-2837. Resolution: Done performance tests for ALS - Key: SPARK-2837 URL: https://issues.apache.org/jira/browse/SPARK-2837 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz ALS (explicit/implicit) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2836) performance tests for k-means
[ https://issues.apache.org/jira/browse/SPARK-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz closed SPARK-2836. -- Resolution: Fixed performance tests for k-means - Key: SPARK-2836 URL: https://issues.apache.org/jira/browse/SPARK-2836 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2834) performance tests for linear algebra functions
[ https://issues.apache.org/jira/browse/SPARK-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-2834. Resolution: Fixed performance tests for linear algebra functions -- Key: SPARK-2834 URL: https://issues.apache.org/jira/browse/SPARK-2834 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz SVD and PCA -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2829) Implement MLlib performance tests in spark-perf
[ https://issues.apache.org/jira/browse/SPARK-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-2829. Resolution: Fixed Implement MLlib performance tests in spark-perf --- Key: SPARK-2829 URL: https://issues.apache.org/jira/browse/SPARK-2829 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz We don't have performance tests for MLlib in spark-perf: https://github.com/databricks/spark-perf So it is hard to catch regression problems automatically. This is an umbrella JIRA for implementing performance tests for MLlib's algorithms. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2831) performance tests for linear classification methods
[ https://issues.apache.org/jira/browse/SPARK-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-2831. Resolution: Fixed performance tests for linear classification methods --- Key: SPARK-2831 URL: https://issues.apache.org/jira/browse/SPARK-2831 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz 1. logistic regression 2. linear svm 3. with LBFGS 4. naive bayes -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [MLLib]:choosing the Loss function
Hi, // Initialize the optimizer using logistic regression as the loss function with L2 regularization val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) // Set the hyperparameters lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor) // Retrieve the weights val weightsWithIntercept = lbfgs.optimize(data, initialWeights) //Slice weights with intercept into weight and intercept //Initialize Logistic Regression Model val model = new LogisticRegressionModel(weights, intercept) model.predict(test) //Make your predictions The example code doesn't generate the Logistic Regression Model that you can make predictions with. `LBFGS.runMiniBatchLBFGS` outputs a tuple of (weights, lossHistory). The example code was for a benchmark, so it was interested more in the loss history than the model itself. You can also run `val (weightsWithIntercept, localLoss) = LBFGS.runMiniBatchLBFGS ...` slice `weightsWithIntercept` into the intercept and the rest of the weights and instantiate the model again as: val model = new LogisticRegressionModel(weights, intercept) Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Monday, August 11, 2014 11:52:04 AM Subject: Re: [MLLib]:choosing the Loss function Hi, Thanks for the reference to the LBFGS optimizer. I tried to use the LBFGS optimizer, but I am not able to pass it as an input to the LogisticRegression model for binary classification. After studying the code in mllib/classification/LogisticRegression.scala, it appears that the only implementation of LogisticRegression uses GradientDescent as a fixed optimizer. In other words, I dont see a setOptimizer() function that I can use to change the optimizer to LBFGS. I tried to follow the code in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala that makes use of LBFGS, but it is not clear to me where the LogisticRegression model with LBFGS is being returned that I can use for the classification of the test dataset. If some one has sample code that uses LogisticRegression with LBFGS instead of gradientDescent as the optimization algorithm, it would be helpful if you can post it. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
[ https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090498#comment-14090498 ] Burak Yavuz commented on SPARK-2916: will do [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations -- Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz Priority: Blocker While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
[ https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-2916: --- Description: While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 In order to replicate the problem, use aggregate multiple times, maybe over 50-60 times. was: While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations -- Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz Priority: Blocker While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 In order to replicate the problem, use aggregate multiple times, maybe over 50-60 times. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
[ https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-2916: --- Component/s: Spark Core [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations -- Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Burak Yavuz Priority: Blocker While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 In order to replicate the problem, use aggregate multiple times, maybe over 50-60 times. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
[ https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-2916: --- Summary: [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations (was: While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations -- Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2916) While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
Burak Yavuz created SPARK-2916: -- Summary: While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations
[ https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-2916: --- Description: While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations -- Key: SPARK-2916 URL: https://issues.apache.org/jira/browse/SPARK-2916 Project: Spark Issue Type: Bug Components: MLlib Reporter: Burak Yavuz Priority: Blocker While running any of the regression algorithms with gradient descent, the treeAggregate blows up after several iterations. Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: KMeans Input Format
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: AlexanderRiggers alexander.rigg...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input Format I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB) looks like this: 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064 3611 23.7 1.95 etc. It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use the first two, because in the MLlib example file there only use floats. So I also tried the last two: 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15 0.99 My error code in both cases is here: scala import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala scala // Load and parse the data scala val data = sc.textFile(data/outkmeanssm.txt) 14/08/07 16:15:37 INFO MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14 scala val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :16 scala scala // Cluster the data into two classes using KMeans scala val numClusters = 2 numClusters: Int = 2 scala val numIterations = 20 numIterations: Int = 20 scala val clusters = KMeans.train(parsedData, numClusters, numIterations) 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List() 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor: Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5 14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID 3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0
Re: questions about MLLib recommendation models
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() you can use something like: val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) val predictionsAndRatings: RDD[(Double, Double)] = predictions.map{ x = def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r ((x.user, x.product), mapPredictedRating(x.rating)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) This work around worked for me. Regarding your question 2, it will be best of you do a special filtering of the dataset so that you do train for that user and product. If we don't have any data trained on a user, there is no way to predict how he would like a product. That filtering takes a lot of work though. I can share some code on that too if you like. Best, Burak - Original Message - From: Jay Hutfles jayhutf...@gmail.com To: user@spark.apache.org Sent: Thursday, August 7, 2014 1:06:33 PM Subject: questions about MLLib recommendation models I have a few questions regarding a collaborative filtering model, and was hoping for some recommendations (no pun intended...) *Setup* I have a csv file with user/movie/ratings named unimaginatively 'movies.csv'. Here are the contents: 0,0,5 0,1,5 0,2,0 0,3,0 1,0,5 1,3,0 2,1,4 2,2,0 3,0,0 3,1,0 3,2,5 3,3,4 4,0,0 4,1,0 4,2,5 I then load it into an RDD with a nice command like val ratings = sc.textFile(movies.csv).map(_.split(',') match { case Array(u,m,r) = (Rating(u.toInt, m.toInt, r.toDouble))}) So far so good. I'm even okay building a model for predicting the absent values in the matrix with val rank = 10 val iters = 20 val model = ALS.train(ratings, rank, iters) I can then use the model to predict any user/movie rating without trouble, like model.predict(2, 0) *Question 1: * If I were to calculate, say, the mean squared error of the training set (or to my next question, a test set), this doesn't work: ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() Actually, any action on RDDs created by mapping over the RDD[Rating] with a model prediction fails, like ratings.map{ case Rating(u, m, _) = model.predict(u, m) }.collect I get errors due to a scala.MatchError: null. Here's the exact verbiage: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26150.0:1 failed 1 times, most recent failure: Exception failure in TID 7091 on host localhost: scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:18) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:18) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) I think I'm missing something, since I can build up a scala collection of the exact (user, movie) tuples I'm testing, map over that with the model prediction, and it works fine. But if I map over the RDD[Rating], it doesn't. Am I doing something obviously wrong?
Re: [MLLib]:choosing the Loss function
The following code will allow you to run Logistic Regression using L-BFGS: val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor) val weights = lbfgs.optimize(data, initialWeights) The different loss function support you are asking for is the `new LogisticGradient()` part. The different regularizations support is the `new SquaredL2Updater()` The supported loss functions are: 1) Logistic - LogisticGradient 2) LeastSquares - LeastSquaresGradient 3) Hinge - HingeGradient The regularizers are: 0) No regularization - SimpleUpdater 1) L1 regularization - L1Updater 2) L2 regularization - SquaredL2Updater You can find more here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#loss-functions I would suggest using L-BFGS rather than SGD as it's both much faster and more accurate. Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 6:31:14 PM Subject: [MLLib]:choosing the Loss function Hi, According to the MLLib guide, there seems to be support for different loss functions. But I could not find a command line parameter to choose the loss function but only found regType to choose the regularization. Does MLLib support a parameter to choose the loss function? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/localized_docs] 96fe35: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/localized_docs Commit: 96fe3575d28d71967ff6d906c4cc1c720014427e https://github.com/phpmyadmin/localized_docs/commit/96fe3575d28d71967ff6d906c4cc1c720014427e Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-06 (Wed, 06 Aug 2014) Changed paths: M po/tr.mo M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (1622 of 1622) [ci skip] -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Regularization parameters
Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Wednesday, August 6, 2014 6:18:43 PM Subject: Regularization parameters Hi, I tried different regularization parameter values with Logistic Regression for binary classification of my dataset and would like to understand the following results: regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80% regType = L1, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 50% To calculate accuracy I am using 0.5 as threshold. prediction 0.5 is class 0, and prediction = 0.5 is class 1. regParam = 0.0, implies I am not using any regularization, is that correct? If so, it should not matter whether I specify L1 or L2, I should get the same results. So why is the accuracy value different? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] eab70e: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: eab70ea7b0e1e17034ffb90fe246cc836e76fd97 https://github.com/phpmyadmin/phpmyadmin/commit/eab70ea7b0e1e17034ffb90fe246cc836e76fd97 Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-05 (Tue, 05 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2974 of 2974) [ci skip] -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git
Re: Hello All
Hi Guru, Take a look at: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark It has all the information you need on how to contribute to Spark. Also take a look at: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel where a list of issues exist that need fixing. You can also request or propose new additions to Spark. Happy coding! Burak - Original Message - From: Gurumurthy Yeleswarapu guru...@yahoo.com.INVALID To: dev@spark.apache.org Sent: Tuesday, August 5, 2014 2:43:04 PM Subject: Hello All Im new to Spark community. Actively working on Hadoop eco system ( more specifically YARN). I'm very keen on getting my hands dirtily with Spark. Please let me know any pointers to start with. Thanks in advance Best regards Guru Yeleswarapu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[Phpmyadmin-git] [phpmyadmin/phpmyadmin] bc01c1: Translated using Weblate (Turkish)
Branch: refs/heads/master Home: https://github.com/phpmyadmin/phpmyadmin Commit: bc01c12eefc26e03088e30f36fe84cd1e727379c https://github.com/phpmyadmin/phpmyadmin/commit/bc01c12eefc26e03088e30f36fe84cd1e727379c Author: Burak Yavuz hitowerdi...@hotmail.com Date: 2014-08-04 (Mon, 04 Aug 2014) Changed paths: M po/tr.po Log Message: --- Translated using Weblate (Turkish) Currently translated at 100.0% (2962 of 2962) [ci skip] -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ Phpmyadmin-git mailing list Phpmyadmin-git@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/phpmyadmin-git