Strange DAG scheduling behavior on currently dependent RDDs
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed have changed quite significantly. Given 3 RDDs: RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache() RDD2 = RDD1.outputToFile RDD3 = RDD1.groupBy().outputToFile In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and RDD3 to both block waiting for RDD1 to complete and cache- at which point RDD2 and RDD3 both use the cached version to complete their work. Spark 1.2.0 seems to schedule two (be it concurrently running) stages for each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each get run twice). It does not look like there is any sharing of the results between these jobs. Are we doing something wrong? Is there a setting that I'm not understanding somewhere?
Re: What does (### skipped) mean in the Spark UI?
Sorry- replace ### with an actual number. What does a skipped stage mean? I'm running a series of jobs and it seems like after a certain point, the number of skipped stages is larger than the number of actual completed stages. On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the number of skipped stages couldn't be formatted. Cheers On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
What does (### skipped) mean in the Spark UI?
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
Re: Strange DAG scheduling behavior on currently dependent RDDs
I asked this question too soon. I am caching off a bunch of RDDs in a TrieMap so that our framework can wire them together and the locking was not completely correct- therefore it was creating multiple new RDDs at times instead of using cached versions- which were creating completely separate lineages. What's strange is that this bug only surfaced when I updated Spark. On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet cjno...@gmail.com wrote: We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed have changed quite significantly. Given 3 RDDs: RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache() RDD2 = RDD1.outputToFile RDD3 = RDD1.groupBy().outputToFile In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and RDD3 to both block waiting for RDD1 to complete and cache- at which point RDD2 and RDD3 both use the cached version to complete their work. Spark 1.2.0 seems to schedule two (be it concurrently running) stages for each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each get run twice). It does not look like there is any sharing of the results between these jobs. Are we doing something wrong? Is there a setting that I'm not understanding somewhere?
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
On Jan. 5, 2015, 9:09 p.m., Christopher Tubbs wrote: core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java, line 63 https://reviews.apache.org/r/29502/diff/2/?file=804705#file804705line63 Should this be Authorizations.EMPTY? Or should it have a default implementation on WrappingIterator which calls source.getAuthorizations()? Christopher Tubbs wrote: make that `getSource().getAuthorizations()` Specific to this test I returned null because all the other getters (other than what was being explicitly tested) were returning null. Were you thinking WrappingIterator should also provide a getAuthorizations() method? On Jan. 5, 2015, 9:09 p.m., Christopher Tubbs wrote: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java, line 46 https://reviews.apache.org/r/29502/diff/2/?file=804711#file804711line46 I wonder if there's a better way to provide environment options, like this and others, at specific scopes. Maybe use some dependency injection, with annotations, like Servlet @Context or JUnit @Rule: @ScanContext Authorizations auths; (throw error if type is not appropriate for context during injection). This feature would be pretty neat. Were you thinking this would extend past just the IteratorEnvironment into other places? Any other fields you can think of that would benefit from this change other than Authorizations? - Corey --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/#review66725 --- On Dec. 31, 2014, 3:40 p.m., Corey Nolet wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Dec. 31, 2014, 3:40 p.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Bugs: ACCUMULO-3458 https://issues.apache.org/jira/browse/ACCUMULO-3458 Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Jan. 6, 2015, 3:44 p.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Changes --- Fixed based on feedback from Christopher and Keith. Noticed some extra formatting removing whitespace in some places. Bugs: ACCUMULO-3458 https://issues.apache.org/jira/browse/ACCUMULO-3458 Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs (updated) - core/src/main/java/org/apache/accumulo/core/client/BatchDeleter.java 2bfc347 core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/Scanner.java 112179e core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerIterator.java 1e0ac99 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/WrappingIterator.java 060fa76 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/client/impl/ScannerImplTest.java be4d467 core/src/test/java/org/apache/accumulo/core/client/impl/TabletServerBatchReaderTest.java PRE-CREATION core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Jan. 6, 2015, 3:54 p.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Changes --- Removing files which were formatted but not changed in any other way to augment the feature in the commit. Bugs: ACCUMULO-3458 https://issues.apache.org/jira/browse/ACCUMULO-3458 Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs (updated) - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/test/java/org/apache/accumulo/core/client/impl/ScannerImplTest.java be4d467 core/src/test/java/org/apache/accumulo/core/client/impl/TabletServerBatchReaderTest.java PRE-CREATION core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Write and Read file through map reduce
Hitarth, I don't know how much direction you are looking for with regards to the formats of the times but you can certainly read both files into the third mapreduce job using the FileInputFormat by comma-separating the paths to the files. The blocks for both files will essentially be unioned together and the mappers scheduled across your cluster. On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi t.hita...@gmail.com wrote: Hi, I have 6 node cluster, and the scenario is as follows :- I have one map reduce job which will write file1 in HDFS. I have another map reduce job which will write file2 in HDFS. In the third map reduce job I need to use file1 and file2 to do some computation and output the value. What is the best way to store file1 and file2 in HDFS so that they could be used in third map reduce job. Thanks, Hitarth
Re: Submitting spark jobs through yarn-client
Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be struggling with this same problem. It really took an understanding of how the classpath needs to be configured both in YARN and in the client driver application. Here's the example code on github: https://github.com/cjnolet/spark-jetty-server On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote: So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn is being run in client mode. One thing I noticed is that the jar is being set by trying to grab the jar's uri from the classpath resources- in this case I think it's finding the spark-yarn jar instead of spark-assembly so when it tries to runt the ExecutorLauncher.scala, none of the core classes (like org.apache.spark.Logging) are going to be available on the classpath. I hope this is the root of the issue. I'll keep this thread updated with my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Submitting spark jobs through yarn-client
Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Submitting spark jobs through yarn-client
I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Submitting spark jobs through yarn-client
So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn is being run in client mode. One thing I noticed is that the jar is being set by trying to grab the jar's uri from the classpath resources- in this case I think it's finding the spark-yarn jar instead of spark-assembly so when it tries to runt the ExecutorLauncher.scala, none of the core classes (like org.apache.spark.Logging) are going to be available on the classpath. I hope this is the root of the issue. I'll keep this thread updated with my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Submitting spark jobs through yarn-client
.. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Dec. 31, 2014, 1:46 p.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs (updated) - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java, line 197 https://reviews.apache.org/r/29502/diff/1/?file=804415#file804415line197 Can't you pull this from the Scanner? I didn't see a good way to get this info from the scanner. The more I think about this- a simple getter on the scanner would be massively useful. On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java, line 55 https://reviews.apache.org/r/29502/diff/1/?file=804426#file804426line55 It looks like TabletIteratorEnvironment is used for minor compactions. Isn't always setting `Authorizations.EMPTY` a little misleading? Is there something more representative of having all auths we could do here? Maybe extra documentation is enough? Could also throw UnsupportedOperationException or similar when the IteratorScope is something that isn't SCAN? Good point! This should definitely be documented as a scan-time only operation. I'm on the fence about throwing an exception- I think I could go either way on that. On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote: test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java, line 54 https://reviews.apache.org/r/29502/diff/1/?file=804430#file804430line54 Please create a user, assign it the auths you need, and then remove the user after the test. If this test is run against a standalone instance, it should try to leave the system in the same state the test started in. You know I was thinking about this when I was coding the test and totally forgot to change it before I created the patch. - Corey --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/#review66439 --- On Dec. 31, 2014, 1:46 p.m., Corey Nolet wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Dec. 31, 2014, 1:46 p.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Dec. 31, 2014, 10:40 a.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Bugs: ACCUMULO-3458 https://issues.apache.org/jira/browse/ACCUMULO-3458 Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 666a8af core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 9726266 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java 2a79f05 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java bf35557 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29502/ --- (Updated Dec. 31, 2014, 4:05 a.m.) Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and kturner. Changes --- Added accumulo group to review. Repository: accumulo Description --- ACCUMULO-3458 Propagating scan-time authorizations through the IteratorEnvironment so that scan-time iterators can use them. Diffs - core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java 4903656 core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 2552682 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 72cb863 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 9e20cb1 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java 15c33fa core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java 94da7b5 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java fa46360 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java 4521e55 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java 4cebab7 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java 4a45e99 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java a9801b0 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java d1fece5 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 869cc33 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java fe4b16b test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java PRE-CREATION test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet
Submit spark jobs inside web application
I want to have a SparkContext inside of a web application running in Jetty that i can use to submit jobs to a cluster of Spark executors. I am running YARN. Ultimately, I would love it if I could just use somethjing like SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp is deployed and de-allocate them whent he webapp goes down. The problem is, SparkSubmit, just like the spark-submit script, requires that I have a JAR that can be deployed to all the nodes right away. Perhaps I'm not understanding enough about how SPark works internally, but i thought it was similar to the Scala shell where closures can be shipped off to executors at runtime. I was wondering if another approach would be to start up an app in yarn-client mode that does nothing and then have my web-application connect to that master. Any ideas? Thanks.
How to tell if RDD no longer has any children
Let's say I have an RDD which gets cached and has two children which do something with it: val rdd1 = ...cache() rdd1.saveAsSequenceFile() rdd1.groupBy()..saveAsSequenceFile() If I were to submit both calls to saveAsSequenceFile() in thread to take advantage of concurrency (where possible), what's the best way to determine when rdd1 is no longer being used by anything? I'm hoping the best way is not to do reference counting in the futures that are running the saveAsSequenceFile().
Cached RDD
If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()... val rdd3 = rdd1.groupBy()... If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2 and one for rdd3)?
Re: When will spark 1.2 released?
The dates of the jars were still of Dec 10th. I figured that was because the jars were staged in Nexus on that date (before the vote). On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at: http://search.maven.org/#browse%7C717101892 The dates of the jars were still of Dec 10th. Was I looking at the wrong place ? Cheers On Thu, Dec 18, 2014 at 11:10 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, as he posted before, An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote: Patrick is working on the release as we speak -- I expect it'll be out later tonight (US west coast) or tomorrow at the latest. On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote: Interesting, the maven artifacts were dated Dec 10th. However vote for RC2 closed recently: http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+ Cheers On Dec 18, 2014, at 10:02 PM, madhu phatak phatak@gmail.com wrote: It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com vboylin1...@gmail.com wrote: Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang 发自网易邮箱大师 -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
Re: JIRA Tickets for 1.6.2 Release
Have you started tracking a CHANGES list yet (do we need to update anything added back in 1.6.2)? I did start a CHANGES file in the 1.6.2-SNAPSHOT branch. I figure after the tickets settle down I'll just create a new one. On Thu, Dec 18, 2014 at 2:05 PM, Christopher ctubb...@apache.org wrote: I triage'd some of the issues, deferring to 1.7 if they were marked with a fixVersion of 1.5.x or 1.6.x. I left documentation issues alone, as well as tests-related improvements and tasks. I commented on a few which looked like they were general internal improvements that weren't necessarily bugs. Feel free to change them to bugs if I make an incorrect choice on those. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Dec 18, 2014 at 1:06 PM, Josh Elser josh.el...@gmail.com wrote: Thanks for starting this up, Corey. Have you started tracking a CHANGES list yet (do we need to update anything added back in 1.6.2)? Oof, good point re semver. Let's coordinate on triaging the tickets as there are quite a few. On IRC? I don't want multiple people to spend time looking at the same issues :) Christopher wrote: Because we've agreed on Semver for release versioning, all the JIRAs marked for 1.6.x as something other than Bug (or maybe Task, and Test) should probably have 1.6.x dropped from their fixVersion. They can/should get addressed in 1.7 and later. Those currently marked for 1.6.x need to be triage'd to determine if they've been labeled correctly, though. It's not that we can't improve internals in a patch release with Semver (so long as we don't alter the API)... but Semver helps focus changes to patch releases on things that fix buggy behavior. I'll do some triage later today (after some sleep) if others haven't gotten to it first. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Dec 18, 2014 at 12:44 AM, Corey Noletcjno...@gmail.com wrote: Since we've been discussing cutting an rc0 for testing before we begin the formal release process. I've moved over all the non-blocker tickets from 1.6.2 to 1.6.3 [1]. Many of the tickets that moved haven't been updated since the 1.6.1 release. If there are tickets you feel are necessary for 1.6.2, feel free to move them back and mark them as a blocker [2]. I'd like to get an rc0 out very soon- possibly in the next couple of days. [1] https://issues.apache.org/jira/issues/?jql=project%20% 3D%20ACCUMULO%20AND%20fixVersion%20%3D%201.6.3 [2] https://issues.apache.org/jira/issues/?jql=project%20% 3D%20Accumulo%20and%20priority%20%3D%20Blocker% 20and%20fixVersion%20%3D%201.6.2%20and%20status%20%3D%20Open
Re: 1.6.2 candidates
I'll cut one tonight On Wed, Dec 17, 2014 at 1:52 PM, Christopher ctubb...@apache.org wrote: I think we could probably put together a non-voting RC0 to start testing with. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Tue, Dec 16, 2014 at 11:28 PM, Eric Newton eric.new...@gmail.com wrote: We are running 1.6.1 w/patches in production already. I would much rather have a 1.6.2 official release. I may have temporary access to a small cluster (3-ish racks) to run some of the long running tests on bare metal. Testing sooner, rather than later is preferable. On Tue, Dec 16, 2014 at 7:18 PM, Corey Nolet cjno...@gmail.com wrote: I have cycles to spin the RCs- I wouldn't mind finishing the updates (per my notes) of the release documentation as well. On Tue, Dec 16, 2014 at 7:11 PM, Christopher ctubb...@apache.org wrote: I think it'd be good to let somebody else exercise the process a bit, but I can make the RCs if nobody else volunteers. My primary concern is that people will have time to test. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Tue, Dec 16, 2014 at 6:37 PM, Josh Elser josh.el...@gmail.com wrote: +1 There are lots of good bug fixes in 1.6.2 already. I can make some time to test, document, etc. Are you volunteering to spin the RCs as well? Christopher wrote: I'm thinking we should look at releasing 1.6.2 in January. I'd say sooner, but I don't know if people will have time to test if we start putting together RCs this week or next. -- Christopher L Tubbs II http://gravatar.com/ctubbsii
JIRA Tickets for 1.6.2 Release
Since we've been discussing cutting an rc0 for testing before we begin the formal release process. I've moved over all the non-blocker tickets from 1.6.2 to 1.6.3 [1]. Many of the tickets that moved haven't been updated since the 1.6.1 release. If there are tickets you feel are necessary for 1.6.2, feel free to move them back and mark them as a blocker [2]. I'd like to get an rc0 out very soon- possibly in the next couple of days. [1] https://issues.apache.org/jira/issues/?jql=project%20%3D%20ACCUMULO%20AND%20fixVersion%20%3D%201.6.3 [2] https://issues.apache.org/jira/issues/?jql=project%20%3D%20Accumulo%20and%20priority%20%3D%20Blocker%20and%20fixVersion%20%3D%201.6.2%20and%20status%20%3D%20Open
build.sh script still being used?
I'm working on updating the Making a Release page on our website [1] with more detailed instructions on the steps involved. Create the candidate section references the build.sh script and I'm contemplating just removing it altogether since it seems like, after quick discussions with a few individuals, maven is mostly being called directly. I don't want to remove this, however, if there are others in the community who still feel it is necessary. The commands that are present in the script are going to be well documented on the page already. Do we need to keep the script around? [1] http://accumulo.apache.org/releasing.html
Re: 1.6.2 candidates
I have cycles to spin the RCs- I wouldn't mind finishing the updates (per my notes) of the release documentation as well. On Tue, Dec 16, 2014 at 7:11 PM, Christopher ctubb...@apache.org wrote: I think it'd be good to let somebody else exercise the process a bit, but I can make the RCs if nobody else volunteers. My primary concern is that people will have time to test. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Tue, Dec 16, 2014 at 6:37 PM, Josh Elser josh.el...@gmail.com wrote: +1 There are lots of good bug fixes in 1.6.2 already. I can make some time to test, document, etc. Are you volunteering to spin the RCs as well? Christopher wrote: I'm thinking we should look at releasing 1.6.2 in January. I'd say sooner, but I don't know if people will have time to test if we start putting together RCs this week or next. -- Christopher L Tubbs II http://gravatar.com/ctubbsii
Spark eating exceptions in multi-threaded local mode
I've been running a job in local mode using --master local[*] and I've noticed that, for some reason, exceptions appear to get eaten- as in, I don't see them. If i debug in my IDE, I'll see that an exception was thrown if I step through the code but if I just run the application, it appears everything completed but i know a bunch of my jobs did not actually ever get run. The exception is happening in a map stage. Is there a special way this is supposed to be handled? Am I missing a property somewhere that allows these to be bubbled up?
Re: accumulo join order count,sum,avg
A good example of the count/sum/average can be found in our StatsCombiner example [1]. Joins are a complicated one- your implementation of joins will really depend on your data set and the expected sizes of each side of the join. You can obviously always resort to joining data together on different tablets using Mapreduce or Spark but you may be able to simulate more real-time joins if your data allows. Ordering is kind of the same here- depending on your data, you could use specialized indexes that take advantage of the Accumulo keys already being sorted. If you can provide some more detail about your data set, we may be able to provide more specific examples on how to accomplish this. [1] https://accumulo.apache.org/1.6/examples/combiner.html On Sun, Dec 14, 2014 at 9:02 PM, panqing...@163.com panqing...@163.com wrote: Accumulo implementation of the join order count sum AVG how to achieve this? -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/accumulo-join-order-count-sum-avg-tp12568.html Sent from the Developers mailing list archive at Nabble.com.
Re: accumulo Scanner
You're going to want to use WholeRowIterator.decodeRow(entry.getKey(), entry.getValue()) for that one. You can do: for(EntryKey,Value entry : scanner) { for(EntryKey,Value actualEntry : WholeRowIterator.decodeRow(entry.getKey(), entry.getValue()).entrySet()) { // do something with actualEntry } } On Thu, Dec 11, 2014 at 10:24 PM, panqing...@163.com panqing...@163.com wrote: I try to use the WholeRowIterator, the same rowkey data into a line, Now, Value contains ColumnFamily, ColumnQualifier, value,but the value of Value should be how to analysis? for (EntryKey, Value entry : scanner) { log.info( + entry.getKey() + , + entry.getValue()); } -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/accumulo-Scanner-tp12506p12552.html Sent from the Developers mailing list archive at Nabble.com.
Re: Accumulo Working Day
Also talked a little about Christopher's working on a new API design: https://github.com/ctubbsii/accumulo/blob/ACCUMULO-2589/ On Tue, Dec 9, 2014 at 11:56 PM, Josh Elser josh.el...@gmail.com wrote: Just so you don't think I forgot, there wasn't really much to report today. Lots of friendly banter among everyone. The notable discussion was likely Don Miner stopping by and the collective trying to brainstorm suggestions as to who would be a good candidate for a high-profile keynote speaker for Accumulo Summit 2015 :) We also talked a little bit about metrics (with the recent support for Hadoop metrics2 added) which helped bring some other devs up to speed who hadn't looked at what such support really means. Let me know if I forgot anything other attendees. Josh Elser wrote: I'd be happy to. Not too much discussion yet, but if we talk about anything that doesn't end up on JIRA or elsewhere, I'll make sure it gets posted here. - Josh Mike Drob wrote: For those of us who were unable to attend, can we get a summary of what happened? I'd be curious to know if anything particularly novel came out of this collaboration! On Mon, Dec 8, 2014 at 4:06 PM, Jason Pyeronjpye...@pdinc.us wrote: If you are meeting near Ft. Meade I would like to drop off thank you doughnuts. -Jason -Original Message- From: Keith Turner Sent: Wednesday, December 03, 2014 14:00 Christopher, Eric, Josh, Billie, Mike, and I are meeting on Dec 9 to work on Accumulo together for the day in Central MD. If you are interested in joining us, email me directly. We are meeting in a small conf room, so space is limited. Keith -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
Possible typo in the Hadoop Latest Stable Release Page
I'm looking @ this page: http://hadoop.apache.org/docs/stable/ Is it a typo that Hadoop 2.6.0 is based on 2.4.1? Thanks.
Re: Running two different Spark jobs vs multi-threading RDDs
Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote: I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached RDD to provide reports I'm trying to figure out how the resources will be scheduled to perform these stages if I were to concurrently run the two RDDs that depend on the first RDD. Would the two RDDs run sequentially? Will they both run @ the same time and be smart about how they are caching? Would this be a time when I'd want to use Tachyon instead and run this as 2 separate physical jobs: one to place the shared data in the RAMDISK and one to run the two dependent RDDs concurrently? Or would it even be best in that case to run 3 completely separate jobs? We're planning on using YARN so there's 2 levels of scheduling going on. We're trying to figure out the best way to utilize the resources so that we are fully saturating the system and making sure there's constantly work being done rather than anything spinning gears waiting on upstream processing to occur (in mapreduce, we'd just submit a ton of jobs and have them wait in line).
Running two different Spark jobs vs multi-threading RDDs
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached RDD to provide reports I'm trying to figure out how the resources will be scheduled to perform these stages if I were to concurrently run the two RDDs that depend on the first RDD. Would the two RDDs run sequentially? Will they both run @ the same time and be smart about how they are caching? Would this be a time when I'd want to use Tachyon instead and run this as 2 separate physical jobs: one to place the shared data in the RAMDISK and one to run the two dependent RDDs concurrently? Or would it even be best in that case to run 3 completely separate jobs? We're planning on using YARN so there's 2 levels of scheduling going on. We're trying to figure out the best way to utilize the resources so that we are fully saturating the system and making sure there's constantly work being done rather than anything spinning gears waiting on upstream processing to occur (in mapreduce, we'd just submit a ton of jobs and have them wait in line).
Re: [VOTE] ACCUMULO-3176
+1 in case it wasn't inferred from my previous comments. As Josh stated, I'm still confused how the veto still holds technical justification- the changes being made aren't removing methods from the public API. On Mon, Dec 1, 2014 at 3:42 PM, Josh Elser josh.el...@gmail.com wrote: I still don't understand what could even be changed to help you retract your veto. A number of people here have made suggestions about altering the changes to the public API WRT to the major version. I think Brian was the most recent, but I recall asking the same question on the original JIRA issue too. Sean Busbey wrote: I'm not sure what questions weren't previously answered in my explanations, could you please restate which ever ones you want clarification on? The vote is closed and only has 2 binding +1s. That means it fails under consensus rules regardless of my veto, so the issue seems moot. On Mon, Dec 1, 2014 at 1:59 PM, Christopherctubb...@apache.org wrote: So, it's been 5 days since last activity here, and there are still some questions/requests for response left unanswered regarding the veto. I'd really like a response to these questions so we can put this issue to rest. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Wed, Nov 26, 2014 at 1:21 PM, Christopherctubb...@apache.org wrote: On Wed, Nov 26, 2014 at 11:57 AM, Sean Busbeybus...@cloudera.com wrote: Responses to a few things below. On Tue, Nov 25, 2014 at 2:56 PM, Brian Lossbfl...@praxiseng.com wrote: Aren’t API-breaking changes allowed in 1.7? If this change is ok for 2.0, then what is the technical reason why it is ok for version 2.0 but vetoed for version 1.7? On Nov 25, 2014, at 3:48 PM, Sean Busbeybus...@cloudera.com wrote: How about if we push this change in the API out to the client reworking in 2.0? Everything will break there anyways so users will already have to deal with the change. As I previously mentioned, API breaking changes are allowed on major revisions. Currently, 1.7 is a major revision (and I have consistently argued for it to remain classified as such). That doesn't mean we shouldn't consider the cost to end users of making said changes. There is no way to know that there won't be a 1.8 or later version after 1.7 and before 2.0. We already have consensus to do a sweeping overhaul of the API for that later release and have had that consensus for quite some time. Since users will already have to deal with that breakage in 2.0 I don't see this improvement as worth making them deal with changes prior to that. So, are you arguing for no more API additions until 2.0? Because, that's what it sounds like. As is, your general objection to the API seems to be independent of this change, but reflective of an overall policy for API additions. Please address why your argument applies to this specific change, and wouldn't to other API additions. Otherwise, this seems to be a case of special pleading. Please address the fact that there is no breakage here, and we can ensure that there won't be any more removal (except in exceptional circumstances) of deprecated APIs until 2.0 to ease changes. (I actually think that would be a very reasonable policy to adopt today.) In addition, I fully expect that 2.0 will be fully compatible with 1.7, and will also not introduce any breakage except removal of things already deprecated in 1.7. If we make this change without marking the previous createTable methods as deprecated, this new API addition AND the previous createTable API will still be available in 2.0 (as deprecated), and will not be removed until 3.0. You have also previously argued for more intermediate releases between major releases. Please explain how you see omitting this API addition is compatible with that goal. Please also explain why, if you consider 1.7 to be a major (expected) release, why such an addition would not be appropriate, but would be appropriate for a future major release (2.0). On Tue, Nov 25, 2014 at 4:18 PM, Christopherctubb...@apache.org wrote: On Tue, Nov 25, 2014 at 5:07 PM, Bill Havanki bhava...@clouderagovt.com wrote: In my interpretation of Sean's veto, what he says is bad - using the ASF word here - is not that the change leaves the property update unsolved. It's that it changes the API without completely solving it. The purpose of the change is not explicitly to alter the API, but it does cause that to happen, and it is that aspect that is bad (with the given justification). I just want to clarify my reasoning. That is my current understanding, as well. Additionally, it seems to me that the two things that make it bad is that it A) doesn't achieve an additional purpose (which can be achieved with additional work), and that B) it deprecates existing methods (which can be avoided). Unless there's some other reason that
Re: Can MiniAccumuloCluster reuse directory?
I had a ticket for that awhile back and I don't believe it was ever completed. By default, it wants to dump out new config files for everything- have it reusing a config file would mean not re-initializing each time and reusing the same instance id + rfiles. ACCUMULO-1378 was the it and it looks like it's still open. https://issues.apache.org/jira/browse/ACCUMULO-1378?filter=-1 On Sun, Nov 30, 2014 at 7:09 PM, Josh Elser josh.el...@gmail.com wrote: Not sure if it already can. I believe it would be trivial to do so. David Medinets wrote: When starting MiniAccumuloCluster, can I point to a directory used by a previous MiniAccumuloCluster? If not, would it be infeasible to do so?
[jira] [Commented] (ACCUMULO-3371) Allow user to set Zookeeper port in MiniAccumuloCluster
[ https://issues.apache.org/jira/browse/ACCUMULO-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228916#comment-14228916 ] Corey Nolet commented on ACCUMULO-3371: --- David, http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/minicluster/MiniAccumuloConfig.html I remember adding this in the 1.6 line. Is that not what you were looking for? On Sat, Nov 29, 2014 at 10:32 AM, David Medinets (JIRA) j...@apache.org Allow user to set Zookeeper port in MiniAccumuloCluster --- Key: ACCUMULO-3371 URL: https://issues.apache.org/jira/browse/ACCUMULO-3371 Project: Accumulo Issue Type: Improvement Components: mini Affects Versions: 1.6.1 Environment: Ubuntu inside Docker Reporter: David Medinets Priority: Minor I'm experimenting with Docker to use MiniAccumuloCluster inside containers. However, I ran into an issue that the Zookeeper port is randomly assigned and therefore can't be exposed to the host machine running the Docker container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Bylaws Change - Majority Approval for Code Changes
Jeremy, The PMC boards in ASF are re On Wed, Nov 26, 2014 at 1:18 PM, Jeremy Kepner kep...@ll.mit.edu wrote: To be effective, most boards need to be small (~5 people) and not involved with day-to-day. Ideally, if someone says let's bring this to the board for a decision the collective response should be no, let's figure out a compromise. On Wed, Nov 26, 2014 at 12:26:09PM -0600, Mike Drob wrote: Jeremey, FWIW I believe that the PMC is supposed to be that board. In our case, it happens to also be the same population as the committers, because it was suggested that the overlap leads to a healthier community overall. On Wed, Nov 26, 2014 at 12:02 PM, Jeremy Kepner kep...@ll.mit.edu wrote: -1 (I vote to keep current consensus approach) An alternative method for resolution would be to setup an elected (or appointed) advisory board of a small number of folks whose job it is to look out for the long-term health and strategy of Accumulo. This board could then be appealed to on the rare occassions when consensus over important long-term issues cannot be achieved. Just the presence of such a board often has the effect encouraging productive compromise amongst participants. On Wed, Nov 26, 2014 at 05:33:40PM +, dlmar...@comcast.net wrote: It was suggested in the ACCUMULO-3176 thread that code changes should be majority approval instead of consensus approval. I'd like to explore this idea as it might keep the voting email threads less verbose and leave the discussion and consensus building to the comments in JIRA. Thoughts?
Re: Unsubscribe
send an email to user-unsubscr...@hadoop.apache.org to unsubscribe. On Wed, Nov 26, 2014 at 3:08 PM, Li Chen ahli1...@gmail.com wrote: Please unsubscribe me, too. Li On Wed, Nov 26, 2014 at 3:03 PM, Sufi Nawaz s...@eaiti.com wrote: Please suggest how to unsubscribe from this list. Thank you, *Sufi Nawaz *Application Innovator e: s...@eaiti.com */ *w: www.eaiti.com o: (571) 306-4683 */ *c: (940) 595-1285
Re: [VOTE] ACCUMULO-3176
I could understand the veto if the change actually caused one of the issues mentioned above or the issue that Sean is raising. But it does not. The eventual consistency of property updates was an issue before this change and continues to be an issue. This JIRA did not attempt to address the property update issue. You said this before I could and I couldn't agree more. Everything will break there anyways so users will already have to deal with the change. I didn't see any methods removed from the API but I could be missing something. I just see a new create() method added. On Tue, Nov 25, 2014 at 3:56 PM, Brian Loss bfl...@praxiseng.com wrote: Aren’t API-breaking changes allowed in 1.7? If this change is ok for 2.0, then what is the technical reason why it is ok for version 2.0 but vetoed for version 1.7? On Nov 25, 2014, at 3:48 PM, Sean Busbey bus...@cloudera.com wrote: How about if we push this change in the API out to the client reworking in 2.0? Everything will break there anyways so users will already have to deal with the change. -- Sean
Re: Configuring custom input format
I was wiring up my job in the shell while i was learning Spark/Scala. I'm getting more comfortable with them both now so I've been mostly testing through Intellij with mock data as inputs. I think the problem lies more on Hadoop than Spark as the Job object seems to check it's state and throw an exception when the toString() method is called before the Job has physically been submitted. On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote: The closer I look @ the stack trace in the Scala shell, it appears to be the call to toString() that is causing the construction of the Job object to fail. Is there a ways to suppress this output since it appears to be hindering my ability to new up this object? On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying to new up my own Job object with the SparkContext.hadoopConfiguration is throwing the exception on line 283 of this grepcode: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job Looking in the SparkContext code, I'm seeing that it's newing up Job objects just fine using nothing but the configuraiton. Using SparkContext.textFile() appears to be working for me. Any ideas? Has anyone else run into this as well? Is it possible to have a method like SparkContext.getJob() or something similar? Thanks.
[jira] [Commented] (ACCUMULO-1817) Create a monitoring bridge similar to Hadoop's GangliaContext that can allow easy pluggable support
[ https://issues.apache.org/jira/browse/ACCUMULO-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224975#comment-14224975 ] Corey Nolet commented on ACCUMULO-1817: --- Awesome! Given that people have been using the jmx wrapper for ganglia metrics, I didn't know if this was still going to be important for others. I can say it's still something I would like to see. Create a monitoring bridge similar to Hadoop's GangliaContext that can allow easy pluggable support --- Key: ACCUMULO-1817 URL: https://issues.apache.org/jira/browse/ACCUMULO-1817 Project: Accumulo Issue Type: New Feature Reporter: Corey J. Nolet Assignee: Billie Rinaldi Labels: proposed Fix For: 1.7.0 We currently expose JMX and it's possible (with external code) to bridge the JMX to solutions like Ganglia. It would be ideal if the integration were native and pluggable. Turns out that Hadoop (hdfs, mapred) and HBase has direct metrics reporting to Ganglia through some nice code provided in Hadoop. Look into the GangliaContext to see if we can implement Ganglia metrics reporting by Accumulo configuration alone. References: http://wiki.apache.org/hadoop/GangliaMetrics, http://hbase.apache.org/metrics.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Job object toString() is throwing an exception
I was playing around in the Spark shell and newing up an instance of Job that I could use to configure the inputformat for a job. By default, the Scala shell println's the result of every command typed. It throws an exception when it printlns the newly created instance of Job because it looks like it's setting a state upon allocation and it's not happy with the state that it's in when toString() is called before the job is submitted. I'm using Hadoop 2.5.1. I don't see any tickets for this for 2.6. Has anyone else ran into this?
Re: Job object toString() is throwing an exception
Here's the stack trace. I was going to file a ticket for this but wanted to check on the user list first to make sure there wasn't already a fix in the works. It has to do with the Scala shell doing a toString() each time a command is typed in. The stack trace stops the instance of Job from ever being assigned. scala val job = new org.apache.hadoop.mapreduce.Job warning: there were 1 deprecation warning(s); re-run with -deprecation for details java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Tue, Nov 25, 2014 at 9:39 PM, Rohith Sharma K S rohithsharm...@huawei.com wrote: Could you give error message or stack trace? *From:* Corey Nolet [mailto:cjno...@gmail.com] *Sent:* 26 November 2014 07:54 *To:* user@hadoop.apache.org *Subject:* Job object toString() is throwing an exception I was playing around in the Spark shell and newing up an instance of Job that I could use to configure the inputformat for a job. By default, the Scala shell println's the result of every command typed. It throws an exception when it printlns the newly created instance of Job because it looks like it's setting a state upon allocation and it's not happy with the state that it's in when toString() is called before the job is submitted. I'm using Hadoop 2.5.1. I don't see any tickets for this for 2.6. Has anyone else ran into this?
Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted
I was actually about to post this myself- I have a complex join that could benefit from something like a GroupComparator vs having to do multiple grouyBy operations. This is probably the wrong thread for a full discussion on this but I didn't see a JIRA ticket for this or anything similar- any reasons why this would not make sense given Spark's design? On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com wrote: Thanks Patrick. I've been testing some 1.2 features, looks good so far. I have some example code that I think will be helpful for certain MR-style use cases (secondary sort). Can I still add that to the 1.2 documentation, or is that frozen at this point? - -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: unsubscribe
Abdul, Please send an email to user-unsubscr...@spark.apache.org On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote: - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Contribute Examples/Exercises
Mike David, Are you +1 for contributing the examples or +1 for moving the examples out into separate repos? On Fri, Nov 14, 2014 at 12:52 PM, David Medinets david.medin...@gmail.com wrote: +1 On Nov 14, 2014 11:18 AM, Keith Turner ke...@deenlo.com wrote: On Wed, Nov 12, 2014 at 4:52 PM, Corey Nolet cjno...@gmail.com wrote: Josh, My worry with a contrib module is that, historically, code which goes moves to a contrib is just one step away from the grave. You do have a good point. My hope was that this could be the beginning of our changing history so that we could begin to encourage the community to contribute their own source directly and give them an outlet for doing so. I understand that's also the intent of hosting open source repos under ASF to begin with- so I'm partial to either outcome. I think there's precedence for keeping them in core (as Christopher had mentioned, next to examples/simple) which would benefit people externally (more how do I do X examples) and internally (keep devs honest about how our APIs are implemented). I would think that would just require keeping the repos up to date as versions change so they wouldn't get out of date and possibly releasing them w/ our other releases. Wherever they end up living, thank you Adam for the contributions! I'll 2nd that. For the following reasons, I think it might be nice to move existing examples out of core into their own git repo(s). * Examples would be based on released version of Accumulo * Examples could easily be built w/o building all of Accumulo * As Sean said, this would keep us honest * The examples poms would serve as examples more than they do when part of Accumulo build * Less likely to use non public APIs in examples On Wed, Nov 12, 2014 at 2:54 PM, Josh Elser josh.el...@gmail.com wrote: My worry with a contrib module is that, historically, code which goes moves to a contrib is just one step away from the grave. I think there's precedence for keeping them in core (as Christopher had mentioned, next to examples/simple) which would benefit people externally (more how do I do X examples) and internally (keep devs honest about how our APIs are implemented). Bringing the examples into the core also encourages us to grow the community which has been stagnant with respect to new committers for about 9 months now. Corey Nolet wrote: +1 for adding the examples to contrib. I was, myself, reading over this email wondering how a set of 11 separate examples on the use of Accumulo would fit into the core codebase- especially as more are contributed over tinme. I like the idea of giving community members an outlet for contributing examples that they've built so that we can continue to foster that without having to fit them in the core codebase. It just seems more maintainable. On Wed, Nov 12, 2014 at 2:19 PM, Josh Elserjosh.el...@gmail.com wrote: I'll take that as you disagree with my consideration of substantial. Thanks. Mike Drob wrote: The proposed contribution is a collection of 11 examples. It's clearly non-trivial, which is probably enough to be considered substantial On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com wrote: Sean Busbey wrote: On Wed, Nov 12, 2014 at 12:31 PM, Josh Elser josh.el...@gmail.com wrote: Personally, I didn't really think that this contribution was in the spirit of what the new codebase adoption guidelines were meant to cover. Some extra examples which leverage what Accumulo already does seems more like improvements for new Accumulo users than anything else. It's content developed out side of the project list. That's all it takes to require the trip through the Incubator checks as far as the ASF guidelines are concerned. From http://incubator.apache.org/ip-clearance/index.html From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. Not to look a gift-horse in the mouth (it is great work), but I don't see these examples as substantial. I haven't found guidelines yet that better clarify the definition of substantial.
Spark Hadoop 2.5.1
I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is the current stable Hadoop 2.x, would it make sense for us to update the poms?
Re: Spark Hadoop 2.5.1
In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like you've mentioned. What prompted me to write this email was that I did not see any documentation that told me Hadoop 2.5.1 was officially supported by Spark (i.e. community has been using it, any bugs are being fixed, etc...). It builds, tests pass, etc... but there could be other implications that I have not run into based on my own use of the framework. If we are saying that the standard procedure is to build with the hadoop-2.4 profile and override the -Dhadoop.version property, should we provide that on the build instructions [1] at least? [1] http://spark.apache.org/docs/latest/building-with-maven.html On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote: I don't think it's necessary. You're looking at the hadoop-2.4 profile, which works with anything = 2.4. AFAIK there is no further specialization needed beyond that. The profile sets hadoop.version to 2.4.0 by default, but this can be overridden. On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet cjno...@gmail.com wrote: I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is the current stable Hadoop 2.x, would it make sense for us to update the poms?
Re: Contribute Examples/Exercises
+1 for adding the examples to contrib. I was, myself, reading over this email wondering how a set of 11 separate examples on the use of Accumulo would fit into the core codebase- especially as more are contributed over tinme. I like the idea of giving community members an outlet for contributing examples that they've built so that we can continue to foster that without having to fit them in the core codebase. It just seems more maintainable. On Wed, Nov 12, 2014 at 2:19 PM, Josh Elser josh.el...@gmail.com wrote: I'll take that as you disagree with my consideration of substantial. Thanks. Mike Drob wrote: The proposed contribution is a collection of 11 examples. It's clearly non-trivial, which is probably enough to be considered substantial On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com wrote: Sean Busbey wrote: On Wed, Nov 12, 2014 at 12:31 PM, Josh Elserjosh.el...@gmail.com wrote: Personally, I didn't really think that this contribution was in the spirit of what the new codebase adoption guidelines were meant to cover. Some extra examples which leverage what Accumulo already does seems more like improvements for new Accumulo users than anything else. It's content developed out side of the project list. That's all it takes to require the trip through the Incubator checks as far as the ASF guidelines are concerned. From http://incubator.apache.org/ip-clearance/index.html From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. Not to look a gift-horse in the mouth (it is great work), but I don't see these examples as substantial. I haven't found guidelines yet that better clarify the definition of substantial.
Re: Contribute Examples/Exercises
Josh, My worry with a contrib module is that, historically, code which goes moves to a contrib is just one step away from the grave. You do have a good point. My hope was that this could be the beginning of our changing history so that we could begin to encourage the community to contribute their own source directly and give them an outlet for doing so. I understand that's also the intent of hosting open source repos under ASF to begin with- so I'm partial to either outcome. I think there's precedence for keeping them in core (as Christopher had mentioned, next to examples/simple) which would benefit people externally (more how do I do X examples) and internally (keep devs honest about how our APIs are implemented). I would think that would just require keeping the repos up to date as versions change so they wouldn't get out of date and possibly releasing them w/ our other releases. Wherever they end up living, thank you Adam for the contributions! On Wed, Nov 12, 2014 at 2:54 PM, Josh Elser josh.el...@gmail.com wrote: My worry with a contrib module is that, historically, code which goes moves to a contrib is just one step away from the grave. I think there's precedence for keeping them in core (as Christopher had mentioned, next to examples/simple) which would benefit people externally (more how do I do X examples) and internally (keep devs honest about how our APIs are implemented). Bringing the examples into the core also encourages us to grow the community which has been stagnant with respect to new committers for about 9 months now. Corey Nolet wrote: +1 for adding the examples to contrib. I was, myself, reading over this email wondering how a set of 11 separate examples on the use of Accumulo would fit into the core codebase- especially as more are contributed over tinme. I like the idea of giving community members an outlet for contributing examples that they've built so that we can continue to foster that without having to fit them in the core codebase. It just seems more maintainable. On Wed, Nov 12, 2014 at 2:19 PM, Josh Elserjosh.el...@gmail.com wrote: I'll take that as you disagree with my consideration of substantial. Thanks. Mike Drob wrote: The proposed contribution is a collection of 11 examples. It's clearly non-trivial, which is probably enough to be considered substantial On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com wrote: Sean Busbey wrote: On Wed, Nov 12, 2014 at 12:31 PM, Josh Elserjosh.el...@gmail.com wrote: Personally, I didn't really think that this contribution was in the spirit of what the new codebase adoption guidelines were meant to cover. Some extra examples which leverage what Accumulo already does seems more like improvements for new Accumulo users than anything else. It's content developed out side of the project list. That's all it takes to require the trip through the Incubator checks as far as the ASF guidelines are concerned. From http://incubator.apache.org/ip-clearance/index.html From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. Not to look a gift-horse in the mouth (it is great work), but I don't see these examples as substantial. I haven't found guidelines yet that better clarify the definition of substantial.
Spark SQL Lazy Schema Evaluation
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the code- I see an inferSchema() method which gets called under the hood. I also see an experimental jsonRDD() method which has a sampleRatio. My dataset is extremely large and i've got a lot of processing to do on it- it's really not a luxury to be able to loop through it twice. I also know that the SQL I am going to be running matches at least some of the records contained in the files. Would it make sense or be possible with the current execution plan design to be able to bypass inferring the schema for purposes of speed? Though I haven't really dug further in the code than the implementations of the client API methods that I'm calling, I am wondering if there's a way to theoretically process the data without pre-determining the schema. I also don't have the luxury of giving the full schema ahead of time because i may want to do a select * from table but I may only know 2 or 3 of the actual json keys that are available. Thanks.
Re: [VOTE] Designating maintainers for some Spark components
+1 (non-binding) [for original process proposal] Greg, the first time I've seen the word ownership on this thread is in your message. The first time the word lead has appeared in this thread is in your message as well. I don't think that was the intent. The PMC and Committers have a responsibility to the community to make sure that their patches are being reviewed and committed. I don't see in Apache's recommended bylaws anywhere that says establishing responsibility on paper for specific areas cannot be taken on by different members of the PMC. What's been proposed looks, to me, to be an empirical process and it looks like it has pretty much a consensus from the side able to give binding votes. I don't at all this model establishes any form of ownership over anything. I also don't see in the process proposal where it mentions that nobody other than the persons responsible for a module can review or commit code. In fact, I'll go as far as to say that since Apache is a meritocracy, the people who have been aligned to the responsibilities probably were aligned based on some sort of meric, correct? Perhaps we could dig in and find out for sure... I'm still getting familiar with the Spark community myself. On Thu, Nov 6, 2014 at 7:28 PM, Patrick Wendell pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two
Re: [VOTE] Designating maintainers for some Spark components
PMC [1] is responsible for oversight and does not designate partial or full committer. There are projects where all committers become PMC and others where PMC is reserved for committers with the most merit (and willingness to take on the responsibility of project oversight, releases, etc...). Community maintains the codebase through committers. Committers to mentor, roll in patches, and spread the project throughout other communities. Adding someone's name to a list as a maintainer is not a barrier. With a community as large as Spark's, and myself not being a committer on this project, I see it as a welcome opportunity to find a mentor in the areas in which I'm interested in contributing. We'd expect the list of names to grow as more volunteers gain more interest, correct? To me, that seems quite contrary to a barrier. [1] http://www.apache.org/dev/pmc.html On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. Matei On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote: Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto: gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a
Re: [VOTE] Designating maintainers for some Spark components
I'm actually going to change my non-binding to +0 for the proposal as-is. I overlooked some parts of the original proposal that, when reading over them again, do not sit well with me. one of the maintainers needs to sign off on each patch to the component, as Greg has pointed out, does seem to imply that there are committers with more power than others with regards to specific components- which does imply ownership. My thinking would be to re-work in some way as to take out the accent on ownership. I would maybe focus on things such as: 1) Other committers and contributors being forced to consult with maintainers of modules before patches can get rolled in. 2) Maintainers being assigned specifically from PMC. 3) Oversight to have more accent on keeping the community happy in a specific area of interest vice being a consultant for the design of a specific piece. On Thu, Nov 6, 2014 at 8:46 PM, Arun C Murthy a...@hortonworks.com wrote: With my ASF Member hat on, I fully agree with Greg. As he points out, this is an anti-pattern in the ASF and is severely frowned upon. We, in Hadoop, had a similar trajectory where we had were politely told to go away from having sub-project committers (HDFS, MapReduce etc.) to a common list of committers. There were some concerns initially, but we have successfully managed to work together and build a more healthy community as a result of following the advice on the ASF Way. I do have sympathy for good oversight etc. as the project grows and attracts many contributors - it's essentially the need to have smaller, well-knit developer communities. One way to achieve that would be to have separate TLPs (e.g. Spark, MLLIB, GraphX) with separate committer lists for each representing the appropriate community. Hadoop went a similar route where we had Pig, Hive, HBase etc. as sub-projects initially and then split them into TLPs with more focussed communities to the benefit of everyone. Maybe you guys want to try this too? Few more observations: # In general, *discussions* on project directions (such as new concept of *maintainers*) should happen first on the public lists *before* voting, not in the private PMC list. # If you chose to go this route in spite of this advice, seems to me Spark would be better of having more maintainers per component (at least 4-5), probably with a lot more diversity in terms of affiliations. Not sure if that is a concern - do you have good diversity in the proposed list? This will ensure that there are no concerns about a dominant employer controlling a project. Hope this helps - we've gone through similar journey, got through similar issues and fully embraced the Apache Way (™) as Greg points out to our benefit. thanks, Arun On Nov 6, 2014, at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed
Re: Selecting Based on Nested Values using Language Integrated Query Syntax
Michael, Thanks for the explanation. I was able to get this running. On Wed, Oct 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com wrote: We are working on more helpful error messages, but in the meantime let me explain how to read this output. org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'p.name,'p.age, tree: Project ['p.name,'p.age] Filter ('location.number = 2300) Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber)) Generate explode(locations#10), true, false, Some(l) LowerCaseSchema Subquery p Subquery people SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) LowerCaseSchema Subquery ln Subquery locationNames SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) 'tickedFields indicate a failure to resolve, where as numbered#10 attributes have been resolved. (The numbers are globally unique and can be used to disambiguate where a column is coming from when the names are the same) Resolution happens bottom up. So the first place that there is a problem is 'ln.streetnumber, which prevents the rest of the query from resolving. If you look at the subquery ln, it is only producing two columns: locationName and locationNumber. So streetnumber is not valid. On Tue, Oct 28, 2014 at 8:02 PM, Corey Nolet cjno...@gmail.com wrote: scala locations.queryExecution warning: there were 1 feature warning(s); re-run with -feature for details res28: _4.sqlContext.QueryExecution forSome { val _4: org.apache.spark.sql.SchemaRDD } = == Parsed Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Analyzed Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Optimized Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Physical Plan == ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38 Code Generation: false == RDD == scala people.queryExecution warning: there were 1 feature warning(s); re-run with -feature for details res29: _5.sqlContext.QueryExecution forSome { val _5: org.apache.spark.sql.SchemaRDD } = == Parsed Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Analyzed Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Optimized Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Physical Plan == ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38 Code Generation: false == RDD == Here's when I try executing the join and the lateral view explode() : 14/10/28 23:05:35 INFO ParseDriver: Parse Completed org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'p.name,'p.age, tree: Project ['p.name,'p.age] Filter ('location.number = 2300) Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber)) Generate explode(locations#10), true, false, Some(l) LowerCaseSchema Subquery p Subquery people SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) LowerCaseSchema Subquery ln Subquery locationNames SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply
Configuring custom input format
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying to new up my own Job object with the SparkContext.hadoopConfiguration is throwing the exception on line 283 of this grepcode: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job Looking in the SparkContext code, I'm seeing that it's newing up Job objects just fine using nothing but the configuraiton. Using SparkContext.textFile() appears to be working for me. Any ideas? Has anyone else run into this as well? Is it possible to have a method like SparkContext.getJob() or something similar? Thanks.
Re: Configuring custom input format
The closer I look @ the stack trace in the Scala shell, it appears to be the call to toString() that is causing the construction of the Job object to fail. Is there a ways to suppress this output since it appears to be hindering my ability to new up this object? On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying to new up my own Job object with the SparkContext.hadoopConfiguration is throwing the exception on line 283 of this grepcode: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job Looking in the SparkContext code, I'm seeing that it's newing up Job objects just fine using nothing but the configuraiton. Using SparkContext.textFile() appears to be working for me. Any ideas? Has anyone else run into this as well? Is it possible to have a method like SparkContext.getJob() or something similar? Thanks.
Re: Spark SQL takes unexpected time
Michael, I should probably look closer myself @ the design of 1.2 vs 1.1 but I've been curious why Spark's in-memory data uses the heap instead of putting it off heap? Was this the optimization that was done in 1.2 to alleviate GC? On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari sbir...@wynyardgroup.com wrote: Yes, I am using Spark1.1.0 and have used rdd.registerTempTable(). I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than earlier). I also tried by changing schema to use Long data type in some fields but seems conversion takes more time. Is there any way to specify index ? Though I checked and didn't found any, just want to confirm. For your reference here is the snippet of code. - case class EventDataTbl(EventUID: Long, ONum: Long, RNum: Long, Timestamp: java.sql.Timestamp, Duration: String, Type: String, Source: String, OName: String, RName: String) val format = new java.text.SimpleDateFormat(-MM-dd hh:mm:ss) val cedFileName = hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2 val cedRdd = sc.textFile(cedFileName).map(_.split(,, -1)).map(p = EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7), p(8))) cedRdd.registerTempTable(EventDataTbl) sqlCntxt.cacheTable(EventDataTbl) val t1 = System.nanoTime() println(\n\n10 Most frequent conversations between the Originators and Recipients\n) sql(SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT 10).collect().foreach(println) val t2 = System.nanoTime() println(Time taken + (t2-t1)/10.0 + Seconds) - Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Why mapred for the HadoopRDD?
I'm fairly new to spark and I'm trying to kick the tires with a few InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf instead of a MapReduce Job object. Is there future planned support for the mapreduce packaging?
Re: unsubscribe
Hongbin, Please send an email to user-unsubscr...@spark.apache.org in order to unsubscribe from the user list. On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote: Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing list. But the email I signed up is not right one. The link below to unsubscribe seems not working. https://spark.apache.org/community.html Can anyone help? -- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
Re: Selecting Based on Nested Values using Language Integrated Query Syntax
So it wouldn't be possible to have a json string like this: { name:John, age:53, locations: [{ street:Rodeo Dr, number:2300 }]} And query all people who have a location with number = 2300? On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say, there was an addresses field that had a json array? You can get the Nth item by address.getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL
Re: Selecting Based on Nested Values using Language Integrated Query Syntax
Michael, Awesome, this is what I was looking for. So it's possible to use hive dialect in a regular sql context? This is what was confusing to me- the docs kind of allude to it but don't directly point it out. On Tue, Oct 28, 2014 at 9:30 PM, Michael Armbrust mich...@databricks.com wrote: You can do this: $ sbt/sbt hive/console scala jsonRDD(sparkContext.parallelize({ name:John, age:53, locations: [{ street:Rodeo Dr, number:2300 }]} :: Nil)).registerTempTable(people) scala sql(SELECT name FROM people LATERAL VIEW explode(locations) l AS location WHERE location.number = 2300).collect() res0: Array[org.apache.spark.sql.Row] = Array([John]) This will double show people who have more than one matching address. On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote: So it wouldn't be possible to have a json string like this: { name:John, age:53, locations: [{ street:Rodeo Dr, number:2300 }]} And query all people who have a location with number = 2300? On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say, there was an addresses field that had a json array? You can get the Nth item by address.getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL
Re: Selecting Based on Nested Values using Language Integrated Query Syntax
Am I able to do a join on an exploded field? Like if I have another object: { streetNumber:2300, locationName:The Big Building} and I want to join with the previous json by the locations[].number field- is that possible? On Tue, Oct 28, 2014 at 9:31 PM, Corey Nolet cjno...@gmail.com wrote: Michael, Awesome, this is what I was looking for. So it's possible to use hive dialect in a regular sql context? This is what was confusing to me- the docs kind of allude to it but don't directly point it out. On Tue, Oct 28, 2014 at 9:30 PM, Michael Armbrust mich...@databricks.com wrote: You can do this: $ sbt/sbt hive/console scala jsonRDD(sparkContext.parallelize({ name:John, age:53, locations: [{ street:Rodeo Dr, number:2300 }]} :: Nil)).registerTempTable(people) scala sql(SELECT name FROM people LATERAL VIEW explode(locations) l AS location WHERE location.number = 2300).collect() res0: Array[org.apache.spark.sql.Row] = Array([John]) This will double show people who have more than one matching address. On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote: So it wouldn't be possible to have a json string like this: { name:John, age:53, locations: [{ street:Rodeo Dr, number:2300 }]} And query all people who have a location with number = 2300? On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say, there was an addresses field that had a json array? You can get the Nth item by address.getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL
Re: Selecting Based on Nested Values using Language Integrated Query Syntax
scala locations.queryExecution warning: there were 1 feature warning(s); re-run with -feature for details res28: _4.sqlContext.QueryExecution forSome { val _4: org.apache.spark.sql.SchemaRDD } = == Parsed Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Analyzed Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Optimized Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) == Physical Plan == ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38 Code Generation: false == RDD == scala people.queryExecution warning: there were 1 feature warning(s); re-run with -feature for details res29: _5.sqlContext.QueryExecution forSome { val _5: org.apache.spark.sql.SchemaRDD } = == Parsed Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Analyzed Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Optimized Logical Plan == SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) == Physical Plan == ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38 Code Generation: false == RDD == Here's when I try executing the join and the lateral view explode() : 14/10/28 23:05:35 INFO ParseDriver: Parse Completed org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'p.name,'p.age, tree: Project ['p.name,'p.age] Filter ('location.number = 2300) Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber)) Generate explode(locations#10), true, false, Some(l) LowerCaseSchema Subquery p Subquery people SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at JsonRDD.scala:38) LowerCaseSchema Subquery ln Subquery locationNames SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at JsonRDD.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397) at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358) at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) On Tue, Oct 28, 2014 at 10:48 PM, Michael Armbrust mich...@databricks.com wrote: Can you println the .queryExecution of the SchemaRDD? On Tue, Oct 28, 2014 at 7:43 PM, Corey Nolet cjno...@gmail.com wrote: So this appears to work just fine: hctx.sql(SELECT p.name, p.age FROM people p LATERAL VIEW explode(locations) l AS location JOIN location5 lo ON l.number = lo.streetNumber WHERE location.number = '2300').collect() But as soon as I try to join with another set based on a property from the exploded locations set, I get invalid attribute exceptions: hctx.sql(SELECT p.name, p.age, ln.locationName FROM people as p LATERAL VIEW explode(locations) l AS location JOIN locationNames ln ON location.number = ln.streetNumber WHERE location.number
Re: Accumulo version at runtime?
Dylan, I know your original post mentioned grabbing it through the client API but there's not currently a way to do that. As Sean mentioned, you can do it if you have access to the cluster. You can run the reflection Keith provided by adding the files in $ACCUMULO_HOME/lib/ to your classpath and running your code on the cluster. I definitely think exposing the server version should make it into 1.7. On Fri, Oct 24, 2014 at 6:00 PM, Dylan Hutchison dhutc...@stevens.edu wrote: Keith, I'm confused as to how you would run reflection Constants.class.getDeclaredField(VERSION).get(String.class) on the Accumulo server. Wouldn't we need to compile in a function returning that into the server for a custom Accumulo server build? Say we have a 1.6 Accumulo instance live. A client needs to know the version in order to load the appropriate class. How would you execute that code on the server if it is built from the standard Accumulo 1.6 binaries? On Fri, Oct 24, 2014 at 10:22 AM, Keith Turner ke...@deenlo.com wrote: Below is something I wrote for 1.4 that would grab the version using reflection. If the constant has moved in later versions, could gracefully handle class not found and field not found exceptions and look in other places. https://github.com/keith-turner/instamo/blob/master/src/main/java/instamo/MiniAccumuloCluster.java#L357 On Fri, Oct 24, 2014 at 1:38 AM, Dylan Hutchison dhutc...@stevens.edu wrote: How about a compromise: create *two classes *for the two versions, both implementing the same interface. Instantiate the class for the correct version either from (1) static configuration file or (2) runtime hack lookup to the version on the Monitor. (1) gives safety at the expense of the user having to specify another parameter. (2) looks like it will work at least in the near future going to 1.7, as well as for past versions. Thanks for the suggestions! I like the two classes approach better both as a developer and as a user; no need to juggle JARs. ~Dylan On Fri, Oct 24, 2014 at 12:41 AM, Sean Busbey bus...@cloudera.com wrote: On Thu, Oct 23, 2014 at 10:38 PM, Dylan Hutchison dhutc...@stevens.edu wrote: I'm working on a clean way to handle getting Accumulo monitor info for different versions of Accumulo, since I used methods to extract that information from Accumulo's internals which are version-dependent. As Sean wrote, these are not things one should do, but if it's a choice between getting the info or not... We're thinking of building separate JARs for each 1.x version. Why not just take the version of Accumulo you're going to talk to as configuration information that's given to you as a part of deploying your software? It'll make your life much simpler in the long run. -- Sean -- www.cs.stevens.edu/~dhutchis -- www.cs.stevens.edu/~dhutchis
Re: Raise Java dependency from 6 to 7
A concrete plan and a definite version upon which the upgrade would be applied sounds like it would benefit the community. If you plan far enough out (as Hadoop has done) and give the community enough of a notice, I can't see it being a problem as they would have ample time upgrade. On Sat, Oct 18, 2014 at 9:20 PM, Marcelo Vanzin van...@cloudera.com wrote: Hadoop, for better or worse, depends on an ancient version of Jetty (6), that is even on a different package. So Spark (or anyone trying to use a newer Jetty) is lucky on that front... IIRC Hadoop is planning to move to Java 7-only starting with 2.7. Java 7 is also supposed to be EOL some time next year, so a plan to move to Java 7 and, eventually, Java 8 would be nice. On Sat, Oct 18, 2014 at 5:44 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd also wait a bit until these are gone. Jetty is unfortunately a much hairier topic by the way, because the Hadoop libraries also depend on Jetty. I think it will be hard to update. However, a patch that shades Jetty might be nice to have, if that doesn't require shading a lot of other stuff. Matei On Oct 18, 2014, at 4:37 PM, Koert Kuipers ko...@tresata.com wrote: my experience is that there are still a lot of java 6 clusters out there. also distros that bundle spark still support java 6 On Oct 17, 2014 8:01 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, I've heard a few times that keeping support for Java 6 is a priority for Apache Spark. Given that Java 6 has been publicly EOL'd since Feb 2013 http://www.oracle.com/technetwork/java/eol-135779.html and the last public update was Apr 2013 https://en.wikipedia.org/wiki/Java_version_history#Java_6_updates, why are we still maintaing support for 6? The only people using it now must be paying for the extended support to continue receiving security fixes. Bumping the lower bound of Java versions up to Java 7 would allow us to upgrade from Jetty 8 to 9, which is currently a conflict with the Dropwizard framework and a personal pain point. Java 6 vs 7 for Spark links: Try with resources https://github.com/apache/spark/pull/2575/files#r18152125 for SparkContext et al Upgrade to Jetty 9 https://github.com/apache/spark/pull/167#issuecomment-54544494 Warn when not compiling with Java6 https://github.com/apache/spark/pull/859 Who are the people out there that still need Java 6 support? Thanks! Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How can I use a time window with the trident api?
I started a project to do sliding and tumbling windows in Storm. It could be used directly or as an example. http://github.com/calrissian/flowmix On Oct 9, 2014 11:54 PM, 姚驰 yaoch...@163.com wrote: Hello, I'm trying to use storm to manipulate our monitoring data, but I don't know how to add a time window support under trident api. Can anyone help, thanks a lot.
[ANNOUNCE] Fluo 1.0.0-alpha-1 Released
The Fluo project is happy to announce the 1.0.0-alpha-1 release of Fluo. Fluo is a transaction layer that enables incremental processing on top of Accumulo. It integrates into Yarn using Apache Twill. This is the first release of Fluo and is not ready for production use. We invite developers to try it out, play with the quickstart examples, and contribute back in the form of bug reports, new feature requests, and pull requests. For more information, visit http://www.fluo.io.
Re: C++ accumulo client -- native clients for Python, Go, Ruby etc
I'm all for this- though I'm curious to know the thoughts about maintenance and the design. Are we going to use thrift to tie the C++ client calls into the server-side components? Is that going to be maintained through a separate effort or is the plan to have the Accumulo community officially support it? On Mon, Oct 6, 2014 at 2:34 PM, Josh Elser josh.el...@gmail.com wrote: It'd be really cool to see a C++ client -- fully implemented or not. The increased performance via other languages like you said would be really nice, but I'd also be curious to see how the server characteristics change when the client might be sending data at a much faster rate. My C++ is super rusty these days, but I'd be happy to help out any devs who can spearhead the effort :) John R. Frank wrote: Accumulo Developers, We're trying to boost throughput of non-Java tools with Accumulo. It seems that the lowest hanging fruit is to stop using the thrift proxy. Per discussion about Python and thrift proxy in the users list [1], I'm wondering if anyone is interested in helping with a native C++ client? There is a start on one here [2]. We could offer a bounty or maybe make a consulting project depending who is interested in it. We also looked at trying to run a separate thrift proxy for every worker thread or process. With many cores on a box, eg 32, it just doesn't seem practical to run that many proxies, even if they all run on a single JVM. We'd be glad to hear ideas on that front too. A potentially big benefit of making a proper C++ accumulo client is that it is straightforward to expose native interfaces in Python (via pyObject), Go [3], Ruby [4], and other languages. Thanks for any advice, pointers, interest. John 1-- http://www.mail-archive.com/user@accumulo.apache.org/msg03999.html 2-- https://github.com/phrocker/apeirogon 3-- http://golang.org/cmd/cgo/ 4-- https://www.amberbit.com/blog/2014/6/12/calling-c-cpp-from-ruby/ Sent from +1-617-899-2066
[ANNOUNCE] Apache 1.6.1 Released
The Apache Accumulo project is happy to announce its 1.6.1 release. Version 1.6.1 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes and performance improvements over previous versions. Existing users of 1.6.x are encouraged to upgrade to this version. Users new to Accumulo are encouraged to start with this version as well. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage system that features cell-based access control and customizable server-side processing. It is based on Google's BigTable design and is built on top of Apache Hadoop, Apache Zookeeper, and Apache Thrift. The release is available at http://accumulo.apache.org/downloads/ and release notes at http://accumulo.apache.org/release_notes/1.6.1.html. Thanks. - The Apache Accumulo Team
[ANNOUNCE] Apache 1.6.1 Released
The Apache Accumulo project is happy to announce its 1.6.1 release. Version 1.6.1 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes and performance improvements over previous versions. Existing users of 1.6.x are encouraged to upgrade to this version. Users new to Accumulo are encouraged to start with this version as well. The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage system that features cell-based access control and customizable server-side processing. It is based on Google's BigTable design and is built on top of Apache Hadoop, Apache Zookeeper, and Apache Thrift. The release is available at http://accumulo.apache.org/downloads/ and release notes at http://accumulo.apache.org/release_notes/1.6.1.html. Thanks. - The Apache Accumulo Team
Re: Accumulo Powered By Logo
I think a logo that's more friendly to place in a circle would be useful. The Accumulo logo is very squared off. On Thu, Oct 2, 2014 at 3:39 PM, Mike Drob mad...@cloudera.com wrote: Yea, as an outside observer, I would have no idea what Apache A is, nor any idea how to get more information. Maybe we just need a different logo, altogether, given the context of putting it in the PBA circle. On Thu, Oct 2, 2014 at 2:36 PM, Keith Turner ke...@deenlo.com wrote: I was a looking at the new Accumulo powered logo [1] and thought that just an A[2] may be better. Any other thoughts on how to improve this? Someone mentioned that just the A[2] isn't as informative in the case where someone is completely unfamiliar w/ Accumulo. [1]: http://apache.org/foundation/press/kit/poweredBy/pb-accumulo.jpg [2]: http://people.apache.org/~kturner/pb-accumulo.png
Re: [VOTE] Apache Accumulo 1.6.1 RC1
I'm seeing the behavior under Max OS X and Fedora 19 and they have been consistently failing for me. I'm thinking ACCUMULO-3073. Since others are able to get it to pass, I did not think it should fail the vote solely on that but I do think it needs attention, quickly. On Thu, Sep 25, 2014 at 10:43 AM, Bill Havanki bhava...@clouderagovt.com wrote: I haven't had an opportunity to try it again since my +1, but prior to that it has been consistently failing. - I tried extending the timeout on the test, but it would still time out. - I see the behavior on Mac OS X and under CentOS. (I wonder if it's a JVM thing?) On Wed, Sep 24, 2014 at 9:06 PM, Corey Nolet cjno...@gmail.com wrote: Vote passes with 4 +1's and no -1's. Bill, were you able to get the IT to run yet? I'm still having timeouts on my end as well. On Wed, Sep 24, 2014 at 1:41 PM, Josh Elser josh.el...@gmail.com wrote: The crux of it is that both of the errors in the CRC where single bit variants. y instead of 9 and p instead of 0 Both of these cases are a '1' in the most significant bit of the byte instead of a '0'. We recognized these because y and p are outside of the hex range. Fixing both of these fixes the CRC error (manually verified). That's all we know right now. I'm currently running memtest86. I do not have ECC ram, so it *is* theoretically possible that was the cause. After running memtest for a day or so (or until I need my desktop functional again), I'll go back and see if I can reproduce this again. Mike Drob wrote: Any chance the IRC chats can make it only the ML for posterity? Mike On Wed, Sep 24, 2014 at 12:04 PM, Keith Turnerke...@deenlo.com wrote: On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks rwe...@newbrightidea.com wrote: Interesting that y (0x79) and 9 (0x39) are one bit away from each other. I blame cosmic rays! It is interesting, and thats only half of the story. Its been interesting chatting w/ Josh about this on irc and hearing about his findings. On Wed, Sep 24, 2014 at 9:05 AM, Josh Elserjosh.el...@gmail.com wrote: The offending keys are: 389a85668b6ebf8e 2ff6:4a78 [] 1411499115242 3a10885b-d481-4d00-be00-0477e231ey65:8576b169: 0cd98965c9ccc1d0:ba15529e The careful eye will notice that the UUID in the first component of the value has a different suffix than the next corrupt key/value (ends with ey65 instead of e965). Fixing this in the Value and re-running the CRC makes it pass. and 7e56b58a0c7df128 5fa0:6249 [] 1411499311578 3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb: 499fa72752d82a7c:5c5f19e8 -- // Bill Havanki // Solutions Architect, Cloudera Govt Solutions // 443.686.9283
Re: [accumulo] your /dist/ artifacts - 1 BAD signature
I see what happened. I was expecting the mvn:release plugin to push the prepare for next development iteration which it did not. I just pushed it up and created the tag. I'll work on the release notes in a bit. On Thu, Sep 25, 2014 at 3:33 PM, Christopher ctubb...@apache.org wrote: [note: thread moved to dev@] Okay, I just confirmed that the current files in dist are the same ones in Maven Central are the same ones that we voted on. So, that issue is resolved. I double checked and saw that the gpg-signed tag hasn't been created for 1.6.1 (git tag -s 1.6.1 origin/1.6.1-rc1). I guess technically anybody could do this, and merge it (along with the version bump to 1.6.2-SNAPSHOT commit) to 1.6.2-SNAPSHOT branch (and forward, with -sours), if Corey doesn't have time/gets busy. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 25, 2014 at 2:21 PM, Corey Nolet cjno...@gmail.com wrote: There's still a few things I need to do before announcing the release to the user list. Merging the rc into the next version branch was one of them and creating the official release tag was another. I'll do these tonight as well as writing up the release notes for the site. On Thu, Sep 25, 2014 at 1:59 PM, Christopher ctubb...@apache.org wrote: Also, we can move this list to dev@. There's no reason for it to be private@ . -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 25, 2014 at 1:59 PM, Christopher ctubb...@apache.org wrote: There's one more problem that Keith and I found... it doesn't look like the rc1 branch got merged to 1.6.2-SNAPSHOT. I don't know if some other branch got accidentally merged instead. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 25, 2014 at 1:40 PM, Josh Elser josh.el...@gmail.com wrote: Things look good to me now. I checked the artifacts on dist/ against what I have from evaluating the RC and they appear to match. Anything else we need to do here? Christopher wrote: I was able to confirm the signature is bad. When I checked the RC, the signature was good, so I'm guessing the wrong one just got uploaded. I don't have a copy of the RC that I had previously downloaded, but I was able to grab a copy of what was deployed to Maven central and fix the dist sigs/checksums from that. Now, it's possible that the wrong artifacts were uploaded to Maven central (perhaps the wrong staging repo was promoted?) I can't know that for sure, until I can get to work and check my last download from the RC vote and compare with what's in Maven central now. If that is the case, then we need to determine precisely what is different from this upload and what was voted on and see if we need to immediately re-release as 1.6.2 to fix the problems. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 25, 2014 at 3:12 AM, Henk Penninghe...@apache.org wrote: Hi PMC accumulo, I watch 'www.apache.org/dist/', and I noticed that : -- you have 1 BAD pgp signature accumulo/1.6.1/accumulo-1.6.1-src.tar.gz.asc Please fix this problem soon ; for details, see http://people.apache.org/~henkp/checker/sig.html#project-accumulo http://people.apache.org/~henkp/checker/md5.html For information on how to fix problems, see the faq : http://people.apache.org/~henkp/checker/faq.html Thanks a lot, regards, Henk Penning -- apache.org infrastructure PS. The contents of this message is generated, but the mail itself is sent by hand. PS. Please cc me on all relevant emails. - _ Henk P. Penning, ICT-beta R Uithof WISK-412 _/ _ Faculty of Science, Utrecht University T +31 30 253 4106 / _/ Budapestlaan 6, 3584CD Utrecht, NL F +31 30 253 4553 _/ _/ http://people.cs.uu.nl/henkp/ M penn...@uu.nl _/
Re: [VOTE] Apache Accumulo 1.6.1 RC1
Christopher, are you referring to Keith's last comment or Bill Slacum's? On Thu, Sep 25, 2014 at 9:13 PM, Christopher ctubb...@apache.org wrote: That seems like a reason to vote -1 (and perhaps to encourage others to do so also). I'm not sure this can be helped so long as people have different criteria for their vote, though. If we can fix those issues, I'm ready to vote on a 1.6.2 :) -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 25, 2014 at 2:42 PM, William Slacum wilhelm.von.cl...@accumulo.net wrote: I'm a little concerned we had two +1's that mention failures. The one time when we're supposed to have a clean run through, we have 50% of the participators noticing failure. It doesn't instill much confidence in me. On Thu, Sep 25, 2014 at 2:18 PM, Josh Elser josh.el...@gmail.com wrote: Please make a ticket for it and supply the MAC directories for the test and the failsafe output. It doesn't fail for me. It's possible that there is some edge case that you and Bill are hitting that I'm not. Corey Nolet wrote: I'm seeing the behavior under Max OS X and Fedora 19 and they have been consistently failing for me. I'm thinking ACCUMULO-3073. Since others are able to get it to pass, I did not think it should fail the vote solely on that but I do think it needs attention, quickly. On Thu, Sep 25, 2014 at 10:43 AM, Bill Havanki bhava...@clouderagovt.com wrote: I haven't had an opportunity to try it again since my +1, but prior to that it has been consistently failing. - I tried extending the timeout on the test, but it would still time out. - I see the behavior on Mac OS X and under CentOS. (I wonder if it's a JVM thing?) On Wed, Sep 24, 2014 at 9:06 PM, Corey Noletcjno...@gmail.com wrote: Vote passes with 4 +1's and no -1's. Bill, were you able to get the IT to run yet? I'm still having timeouts on my end as well. On Wed, Sep 24, 2014 at 1:41 PM, Josh Elserjosh.el...@gmail.com wrote: The crux of it is that both of the errors in the CRC where single bit variants. y instead of 9 and p instead of 0 Both of these cases are a '1' in the most significant bit of the byte instead of a '0'. We recognized these because y and p are outside of the hex range. Fixing both of these fixes the CRC error (manually verified). That's all we know right now. I'm currently running memtest86. I do not have ECC ram, so it *is* theoretically possible that was the cause. After running memtest for a day or so (or until I need my desktop functional again), I'll go back and see if I can reproduce this again. Mike Drob wrote: Any chance the IRC chats can make it only the ML for posterity? Mike On Wed, Sep 24, 2014 at 12:04 PM, Keith Turnerke...@deenlo.com wrote: On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks rwe...@newbrightidea.com wrote: Interesting that y (0x79) and 9 (0x39) are one bit away from each other. I blame cosmic rays! It is interesting, and thats only half of the story. Its been interesting chatting w/ Josh about this on irc and hearing about his findings. On Wed, Sep 24, 2014 at 9:05 AM, Josh Elser josh.el...@gmail.com wrote: The offending keys are: 389a85668b6ebf8e 2ff6:4a78 [] 1411499115242 3a10885b-d481-4d00-be00-0477e231ey65:8576b169: 0cd98965c9ccc1d0:ba15529e The careful eye will notice that the UUID in the first component of the value has a different suffix than the next corrupt key/value (ends with ey65 instead of e965). Fixing this in the Value and re-running the CRC makes it pass. and 7e56b58a0c7df128 5fa0:6249 [] 1411499311578 3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb: 499fa72752d82a7c:5c5f19e8 -- // Bill Havanki // Solutions Architect, Cloudera Govt Solutions // 443.686.9283
Re: [VOTE] Apache Accumulo 1.6.1 RC1
Bill, I've been having that same IT issue and said the same thing It's not happening to others. I lifted the timeout completely and it never finished. On Wed, Sep 24, 2014 at 1:13 PM, Mike Drob mad...@cloudera.com wrote: Any chance the IRC chats can make it only the ML for posterity? Mike On Wed, Sep 24, 2014 at 12:04 PM, Keith Turner ke...@deenlo.com wrote: On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks rwe...@newbrightidea.com wrote: Interesting that y (0x79) and 9 (0x39) are one bit away from each other. I blame cosmic rays! It is interesting, and thats only half of the story. Its been interesting chatting w/ Josh about this on irc and hearing about his findings. On Wed, Sep 24, 2014 at 9:05 AM, Josh Elser josh.el...@gmail.com wrote: The offending keys are: 389a85668b6ebf8e 2ff6:4a78 [] 1411499115242 3a10885b-d481-4d00-be00-0477e231ey65:8576b169: 0cd98965c9ccc1d0:ba15529e The careful eye will notice that the UUID in the first component of the value has a different suffix than the next corrupt key/value (ends with ey65 instead of e965). Fixing this in the Value and re-running the CRC makes it pass. and 7e56b58a0c7df128 5fa0:6249 [] 1411499311578 3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb: 499fa72752d82a7c:5c5f19e8
Re: [DISCUSS] Thinking about branch names
+1 Using separate branches in this manner just adds complexity. I was wondering myself why we needed to create separate branches when all we're doing is tagging/deleting the already released ones. The only difference between where one leaves off and another begins is the name of the branch. On Tue, Sep 23, 2014 at 9:04 AM, Christopher ctubb...@apache.org wrote: +1 to static dev branch names per release series. (this would also fix the Jenkins spam when the builds break due to branch name changes) However, I kind of prefer 1.5.x or 1.5-dev, or similar, over simply 1.5, which looks so much like a release version that I wouldn't want it to generate any confusion. Also, for reference, here's a few git commands that might help some people avoid the situation that happened: git remote update git remote prune $(git remote) git config --global push.default current # git 1.8 git config --global push.default simple # git = 1.8 The situation seems to primarily have occurred because of some pushes that succeeded because the local clone was not aware that the remote branches had disappeared. Pruning will clean those up, so that you'll get an error if you try to push. Simple/current push strategy will ensure you don't push all matching branches by default. Josh's proposed solution makes it less likely the branches will disappear/change on a remote, but these are still useful git commands to be aware of, and are related enough to this situation, I thought I'd share. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Mon, Sep 22, 2014 at 11:18 PM, Josh Elser josh.el...@gmail.com wrote: After working on 1.5.2 and today's branch snafu, I think I've come to the conclusion that our branch naming is more pain than it's worth (I believe I was the one who primarily argued for branch names as they are current implemented, so take that as you want). * Trying to making a new branch for the next version as a release is happening forces you to fight with Maven. Maven expects that your next is going to be on the same branch and the way it makes commits and bumps versions for you encourages this. Using a new branch for next is more manual work for the release manager. * The time after we make a release, there's a bit of confusion (I do it too, just not publicly... yet) about what branch do I put this fix for _version_ in?. It's not uncommon to put it in the old branch instead of the new one. The problem arises when the old branch has already been deleted. If a developer has an old version of that branch, there's nothing to tell them hey, your copy of this branch is behind the remote's copy of this branch. I'm not accepting your push! Having a single branch for a release line removes this hassle. Pictorially, I'm thinking we would change from the active branches {1.5.3-SNAPSHOT, 1.6.1-SNAPSHOT, 1.6.2-SNAPSHOT, master} to {1.5, 1.6, master}. (where a git tag would exist for the 1.6.1 RCs). IIRC, the big argument for per-release branches was of encouraging frequent, targeted branches (I know the changes for this version go in this branch). I think most of this can be mitigated by keeping up with frequent releases and coordination with the individual cutting the release. In short, I'm of the opinion that I think we should drop the .z-SNAPSHOT suffix from branch names (e.g. 1.5.3-SNAPSHOT) and move to a shorter x.y (e.g. 1.5) that exists for the lifetime of that version. I think we could also use this approach if/when we change our versioning to start using the x component of x.y.z. Thoughts? - Josh
Re: Apache Storm Graduation to a TLP
Congrats! On Mon, Sep 22, 2014 at 5:16 PM, P. Taylor Goetz ptgo...@gmail.com wrote: I’m pleased to announce that Apache Storm has graduated to a Top-Level Project (TLP), and I’d like to thank everyone in the Storm community for your contributions and help in achieving this important milestone. As part of the graduation process, a number of infrastructure changes have taken place: *New website url:* http://storm.apache.org *New git repo urls:* https://git-wip-us.apache.org/repos/asf/storm.git (for committer push) g...@github.com:apache/storm.git -or- https://github.com/apache/storm.git (for github pull requests) *Mailing Lists:* If you are already subscribed, you’re subscription has been migrated. New messages should be sent to the new address: [list]@storm.apache.org This includes any subscribe/unsubscribe requests. Note: The mail-archives.apache.org site will not reflect these changes until October 1. Most of these changes have already occurred and are seamless. Please update your git remotes and address books accordingly. - Taylor
Re: Apache Storm Graduation to a TLP
Congrats! On Mon, Sep 22, 2014 at 5:16 PM, P. Taylor Goetz ptgo...@gmail.com wrote: I’m pleased to announce that Apache Storm has graduated to a Top-Level Project (TLP), and I’d like to thank everyone in the Storm community for your contributions and help in achieving this important milestone. As part of the graduation process, a number of infrastructure changes have taken place: *New website url:* http://storm.apache.org *New git repo urls:* https://git-wip-us.apache.org/repos/asf/storm.git (for committer push) g...@github.com:apache/storm.git -or- https://github.com/apache/storm.git (for github pull requests) *Mailing Lists:* If you are already subscribed, you’re subscription has been migrated. New messages should be sent to the new address: [list]@storm.apache.org This includes any subscribe/unsubscribe requests. Note: The mail-archives.apache.org site will not reflect these changes until October 1. Most of these changes have already occurred and are seamless. Please update your git remotes and address books accordingly. - Taylor
Re: [VOTE] Apache Accumulo 1.5.2 RC1
If we are concerned with confusion about adoption of new versions, we should make a point to articulate the purpose very clearly in each of the announcements. I was in the combined camp an hour ago and now I'm also thinking we should keep them separate. On Fri, Sep 19, 2014 at 1:16 AM, Josh Elser josh.el...@gmail.com wrote: No we did not bundle any release announcements prior. I also have to agree with Bill -- I don't really see how there would be confusion with a properly worded announcement. Happy to work with anyone who has concerns in this regard to come up with something that is agreeable. I do think they should be separate. On 9/19/14, 1:02 AM, Mike Drob wrote: Did we bundle 1.5.1/1.6.0? If not, they were fairly close together, I think. Historically, we have not done a great job of distinguishing our release lines, so that has led to confusion. Maybe I'm on the path to talking myself out of a combined announcement here. On Thu, Sep 18, 2014 at 9:57 PM, William Slacum wilhelm.von.cl...@accumulo.net wrote: Not to be a total jerk, but what's unclear about 1.5 1.6? Lots of projects have multiple release lines and it's not an issue. On Fri, Sep 19, 2014 at 12:18 AM, Mike Drob mad...@cloudera.com wrote: +1 to combining. I've already had questions about upgrading to this latest release from somebody currently on the 1.6 line. Our release narrative is not clear and we should not muddle the waters. On Thu, Sep 18, 2014 at 7:27 PM, Christopher ctubb...@apache.org wrote: Should we wait to do a release announcement until 1.6.1, so we can batch the two? My main concern here is that I don't want to encourage new 1.5.x adoption when we have 1.6.x, and having two announcements could be confusing to new users who aren't sure which version to start using. We could issue an announcement that primarily mentions 1.6.1, and also mentions 1.5.2 second. That way, people will see 1.6.x as the stable/focus release, but will still inform 1.5.x users of updates. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 18, 2014 at 10:20 PM, Josh Elser josh.el...@gmail.com wrote: Vote passes with 3 +1's and nothing else. Huge thank you to those who made the time to participate. I'll finish up the rest of the release work tonight. On 9/15/14, 12:24 PM, Josh Elser wrote: Devs, Please consider the following candidate for Apache Accumulo 1.5.2 Tag: 1.5.2rc1 SHA1: 039a2c28bdd474805f34ee33f138b009edda6c4c Staging Repository: https://repository.apache.org/content/repositories/ orgapacheaccumulo-1014/ Source tarball: http://repository.apache.org/content/repositories/ orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5. 2/accumulo-1.5.2-src.tar.gz Binary tarball: http://repository.apache.org/content/repositories/ orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5. 2/accumulo-1.5.2-bin.tar.gz (Append .sha1, .md5 or .asc to download the signature/hash for a given artifact.) Signing keys available at: https://www.apache.org/dist/accumulo/KEYS Over 1.5.1, we have 109 issues resolved https://git-wip-us.apache.org/repos/asf?p=accumulo.git;a= blob;f=CHANGES;h=c2892d6e9b1c6c9b96b2a58fc901a76363ece8b0;hb= 039a2c28bdd474805f34ee33f138b009edda6c4c Testing: all unit and functional tests are passing and ingested 1B entries using CI w/ agitation over rc0. Vote will be open until Friday, August 19th 12:00AM UTC (8/18 8:00PM ET, 8/18 5:00PM PT) - Josh
Re: AccumuloMultiTableInputFormat IllegalStatementException
Have you looked at the libjars? The uber jar is one approach, but it can become very ugly very quickly. http://grepalex.com/2013/02/25/hadoop-libjars/ On Wed, Sep 17, 2014 at 10:59 PM, JavaHokie soozandjohny...@gmail.com wrote: Hi Corey, I am now trying to deploy this @ work and I am unable to get this run without putting accumulo-core, accumulo-fate, accumulo-trace, accumulo-tracer, and accumulo-tserver in the $HADOOP_COMMON_HOME/share/hadoop/common directory. Can you tell me how you package your jar to obviate the need to put these jars here? Thanks --John On Sun, Aug 24, 2014 at 6:50 PM, Corey Nolet-2 [via Apache Accumulo] [hidden email] http://user/SendEmail.jtp?type=nodenode=11303i=0 wrote: Awesome John! It's good to have this documented for future users. Keep us updated! On Sun, Aug 24, 2014 at 11:05 AM, JavaHokie [hidden email] http://user/SendEmail.jtp?type=nodenode=11209i=0 wrote: Hi Corey, Just to wrap things up, AccumuloMultipeTableInputFormat is working really well. This is an outstanding feature I can leverage big-time on my current work assignment, an IRAD I am working on, as well as my own prototype project. Thanks again for your help! --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11208.html Sent from the Users mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11209.html To unsubscribe from AccumuloMultiTableInputFormat IllegalStateException, click here. NAML http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: AccumuloMultiTableInputFormat IllegalStatementException http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11303.html Sent from the Users mailing list archive http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html at Nabble.com.
Re: Decouple topology configuration from code
Awhile ago I had written a camel adapter for storm so that spout inputs could come from camel. Not sure how useful it would be for you but its located here: https://github.com/calrissian/storm-recipes/blob/master/camel/src/main/java/org/calrissian/recipes/camel/spout/CamelConsumerSpout.java Hi folks, Apache Camel has a number of DSL which allow its topologies (routes wrt. Camel terminology) to be set up and configured easily. I am interested in such approach for Storm. I found java beans usage in: https://github.com/granthenke/storm-spring/ but sounds fairly limited to me. Is there any other DSL like initiative for Storm ? My second concern is storm cluster management: we’d like to have a registry of topologies and be able to register/destroy/launch/suspend/kill/update registered topologies using a REST API. Is there any tool/initiative to support that ? Thx, /DV *Dominique Villard* *Architecte logiciel / Lead Developer* Orange/OF/DTSI/DSI/DFY/SDFY *tél. 04 97 46 30 03* dominique.vill...@orange.com dominique.vill...@orange-ftgroup.com _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Decouple topology configuration from code
Also, Trident is a DSL for rapidly producing useful analytics in Storm and I've been working on a DSL that makes streams processing for complex event processing possible. That one is located here: https://github.com/calrissian/flowmix On Sep 16, 2014 4:29 AM, dominique.vill...@orange.com wrote: Hi folks, Apache Camel has a number of DSL which allow its topologies (routes wrt. Camel terminology) to be set up and configured easily. I am interested in such approach for Storm. I found java beans usage in: https://github.com/granthenke/storm-spring/ but sounds fairly limited to me. Is there any other DSL like initiative for Storm ? My second concern is storm cluster management: we’d like to have a registry of topologies and be able to register/destroy/launch/suspend/kill/update registered topologies using a REST API. Is there any tool/initiative to support that ? Thx, /DV *Dominique Villard* *Architecte logiciel / Lead Developer* Orange/OF/DTSI/DSI/DFY/SDFY *tél. 04 97 46 30 03* dominique.vill...@orange.com dominique.vill...@orange-ftgroup.com _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Time to release 1.6.1?
I'm on it. I'll get a more formal vote going after I dig through the jira a bit and note what's changed. On Thu, Sep 11, 2014 at 11:06 AM, Christopher ctubb...@apache.org wrote: Also, we can always have a 1.6.2 if there's outstanding bugfixes to release later. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Sep 11, 2014 at 10:36 AM, Eric Newton eric.new...@gmail.com wrote: +1 for 1.6.1. There are people testing a recent 1.6 branch at scale (100s of nodes), with the intent of pushing it to production. I would rather have a released version in production. Thanks for volunteering. Feel free to contact me if you need a hand with anything. -Eric On Wed, Sep 10, 2014 at 1:49 PM, Josh Elser josh.el...@gmail.com wrote: Sure that's fine, Corey. Happy to help coordinate things with you. *Hopefully* it's not too painful :) On 9/10/14, 10:43 AM, Corey Nolet wrote: I had posted this to the mailing list originally after a discussion with Christopher at the Accumulo Summit hack-a-thon and because I wanted to get into the release process to help out. Josh, I still wouldn't mind getting together 1.6.1 if that's okay with you. If nothing else, it would get someone else following the procedures and able to do the release. On Wed, Sep 10, 2014 at 1:22 PM, Josh Elser josh.el...@gmail.com wrote: That's exactly my plan, Christopher. Keith has been the man working on a fix for ACCUMULO-1628 which is what I've been spinning on to get 1.5.2 out the door. I want to spend a little time today looking at his patch to understand the fix and run some tests myself. Hopefully John can retest the patch as well since he had an environment that could reproduce the bug. Right after we get 1.5.2, I'm happy to work on 1.6.1 as well. - Josh On 9/10/14, 10:04 AM, Christopher wrote: Because of ACCUMULO-2988 (upgrade path from 1.4.x -- 1.6.y, y = 1), I'm hoping we can revisit this soon. Maybe get 1.5.2 out the door, followed by 1.6.1 right away. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Fri, Jun 20, 2014 at 10:30 AM, Keith Turner ke...@deenlo.com wrote: On Thu, Jun 19, 2014 at 11:46 AM, Josh Elser josh.el...@gmail.com wrote: I was thinking the same thing, but I also haven't made any strides towards getting 1.5.2 closer to happening (as I said I'd try to do). I still lack physical resources to do the week-long testing as our guidelines currently force us to do. I still think this testing is excessive if we're actually releasing bug-fixes, but it does differentiate us from other communities. I want to run some CI test because of the changes I made w/ walog. I can run the test, but I would like to do that as late as possible. Just let me know when you are thinking of cutting a release. Also, I would like to get 2827 in for the release. I'm really not sure how to approach this which is really why I've been stalling on it. On 6/19/14, 7:18 AM, Mike Drob wrote: I'd like to see 1.5.2 released first, just in case there are issues we discover during that process that need to be addressed. Also, I think it would be useful to resolve the discussion surrounding upgrades[1] before releasing. [1]: http://mail-archives.apache.org/mod_mbox/accumulo-dev/ 201406.mbox/%3CCAGHyZ6LFuwH%3DqGF9JYpitOY9yYDG- sop9g6iq57VFPQRnzmyNQ%40mail.gmail.com%3E On Thu, Jun 19, 2014 at 8:09 AM, Corey Nolet cjno...@gmail.com wrote: I'd like to start getting a candidate together if there are no objections. It looks like we have 65 resolved tickets with a fix version of 1.6.1.
Re: Tablet server thrift issue
As an update, I raised the tablet server memory and I have not seen this error thrown since. I'd like to say raising the memory, alone, was the solution but it appears that I also may be having some performance issues with the switches connecting the racks together. I'll update more as I dive in further. On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet cjno...@gmail.com wrote: Josh, Your advice is definitely useful- I also thought about catching the exception and retrying with a fresh batch writer but the fact that the batch writer failure doesn't go away without being re-instantiated is really only a nuisance. The TabletServerBatchWriter could be designed much better, I agree, but that is not the root of the problem. The Thrift exception that is causing the issue is what I'd like to get to the bottom of. It's throwing the following: *TApplicationException: applyUpdates failed: out of sequence response * I've never seen this exception before in regular use of the client API- but I also just updated to 1.6.0. Google isn't showing anything useful for how exactly this exception could come about other than using a bad threading model- and I don't see any drastic changes or other user complaints on the mailing list that would validate that line of thought. Quite frankly, I'm stumped. This could be a Thrift exception related to a Thrift bug or something bad on my system and have nothing to do with Accumulo. Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had seen the exception before and may remember what it was/how they fixed it. On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote: Don't mean to tell you that I don't think there might be a bug/otherwise, that's pretty much just the limit of what I know about the server-side sessions :) If you have concrete this worked in 1.4.4 and this happens instead with 1.6.0, that'd make a great ticket :D The BatchWriter failure case is pretty rough, actually. Eric has made some changes to help already (in 1.6.1, I think), but it needs an overhaul that I haven't been able to make time to fix properly, either. IIRC, the only guarantee you have is that all mutations added before the last flush() happened are durable on the server. Anything else is a guess. I don't know the specifics, but that should be enough to work with (and saving off mutations shouldn't be too costly since they're stored serialized). On 8/22/14, 5:44 PM, Corey Nolet wrote: Thanks Josh, I understand about the session ID completely but the problem I have is that the exact same client code worked, line for line, just fine in 1.4.4 and it's acting up in 1.6.0. I also seem to remember the BatchWriter automatically creating a new session when one expired without an exception causing it to fail on the client. I know we've made changes since 1.4.4 but I'd like to troubleshoot the actual issue of the BatchWriter failing due to the thrift exception rather than just catching the exception and trying mutations again. The other issue is that I've already submitted a bunch of mutations to the batch writer from different threads. Does that mean I need to be storing them off twice? (once in the BatchWriter's cache and once in my own) The BatchWriter in my ingester is constantly sending data and the tablet servers have been given more than enough memory to be able to keep up. There's no swap being used and the network isn't experiencing any errors. On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote: If you get an error from a BatchWriter, you pretty much have to throw away that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If you want, you should be able to catch/recover from this without having to restart the ingester. If the session ID is invalid, my guess is that it hasn't been used recently and the tserver cleaned it up. The exception logic isn't the greatest (as it just is presented to you as a RTE). https://issues.apache.org/jira/browse/ACCUMULO-2990 On 8/22/14, 4:35 PM, Corey Nolet wrote: Eric Keith, Chris mentioned to me that you guys have seen this issue before. Any ideas from anyone else are much appreciated as well. I recently updated a project's dependencies to Accumulo 1.6.0 built with Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest component which is running all the time with a batch writer using many threads to push mutations into Accumulo. The issue I'm having is a show stopper. At different intervals of time, sometimes an hour, sometimes 30 minutes, I'm getting MutationsRejectedExceptions (server errors) from the TabletServerBatchWriter. Once they start, I need to restart the ingester to get them to stop. They always come back within 30 minutes to an hour... rinse, repeat. The exception always happens on different tablet servers. It's a thrift error saying a message was received out
Re: AccumuloMultiTableInputFormat IllegalStatementException
I'm thinking this could be a yarn.application.classpath configuration problem in your yarn-site.xml. I meant to ask earlier- how are you building your jar that gets deployed? Are you shading it? Using libjars? On Sun, Aug 24, 2014 at 6:56 AM, JavaHokie soozandjohny...@gmail.com wrote: Hey Corey, Yah, sometimes ya just gotta go to the source code. :) It's a weird exception message...I am used to seeing NoClassDefFoundError and ClassNotFoundException. It's also weird that the ReflectionException is no thrown, with NoClassDefFoundError or ClassNotFoundException as the root exception. Anyways, it's a classpath deal, but it's a weird one. I thought maybe I had a 1.5 jar around somewhere, but the fact that the InputConfigurator--also new in Accumulo 1.6--can be found but InputTableConfig cannot is a bit puzzling. But...us Java developers are used to figuring out classpath problems. Currently researching it on my end. Thanks again for all of your help so far--again, much appreciated. Really excited to use this new feature. --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11204.html Sent from the Users mailing list archive at Nabble.com.
Re: AccumuloMultiTableInputFormat IllegalStatementException
Awesome John! It's good to have this documented for future users. Keep us updated! On Sun, Aug 24, 2014 at 11:05 AM, JavaHokie soozandjohny...@gmail.com wrote: Hi Corey, Just to wrap things up, AccumuloMultipeTableInputFormat is working really well. This is an outstanding feature I can leverage big-time on my current work assignment, an IRAD I am working on, as well as my own prototype project. Thanks again for your help! --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11208.html Sent from the Users mailing list archive at Nabble.com.
Re: AccumuloMultiTableInputFormat IllegalStatementException
Awesome! I was going to recommend checking out the code last night so that you could put some logging statements in there. You've probably noticed this already but the MapWritable does not have static type parameters so it dumps out the fully qualified class name so that it can instantiate it back using readFields() when it's deserializing. That error is happening when the reflection is occurring- though it doesn't make much sense. The Accumulo mapreduce packages are obviously on the classpath. If you are still having this issue, I'll keep looking more into this as well. On Aug 23, 2014 2:37 PM, JavaHokie soozandjohny...@gmail.com wrote: I checked out 1.6.0 from git and updated the exception handling for the getInputTableConfigs method, rebuilt, and tested my M/R jobs that use Accumulo as a source or sink just to ensure everything is still working correctly. I then updated the InputConfigurator.getInputTableConfig exception handling and I see the root cause is as follows: java.io.IOException: can't find class: org.apache.accumulo.core.client.mapreduce.InputTableConfig because org.apache.accumulo.core.client.mapreduce.InputTableConfig at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:169) at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.getInputTableConfigs(InputConfigurator.java:563) at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:644) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:342) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:537) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286) The IOException can't find class classname because classname is a new one for me, but at least I have something specific to research. --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11202.html Sent from the Users mailing list archive at Nabble.com.
Re: Tablet server thrift issue
Thanks Josh, I understand about the session ID completely but the problem I have is that the exact same client code worked, line for line, just fine in 1.4.4 and it's acting up in 1.6.0. I also seem to remember the BatchWriter automatically creating a new session when one expired without an exception causing it to fail on the client. I know we've made changes since 1.4.4 but I'd like to troubleshoot the actual issue of the BatchWriter failing due to the thrift exception rather than just catching the exception and trying mutations again. The other issue is that I've already submitted a bunch of mutations to the batch writer from different threads. Does that mean I need to be storing them off twice? (once in the BatchWriter's cache and once in my own) The BatchWriter in my ingester is constantly sending data and the tablet servers have been given more than enough memory to be able to keep up. There's no swap being used and the network isn't experiencing any errors. On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote: If you get an error from a BatchWriter, you pretty much have to throw away that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If you want, you should be able to catch/recover from this without having to restart the ingester. If the session ID is invalid, my guess is that it hasn't been used recently and the tserver cleaned it up. The exception logic isn't the greatest (as it just is presented to you as a RTE). https://issues.apache.org/jira/browse/ACCUMULO-2990 On 8/22/14, 4:35 PM, Corey Nolet wrote: Eric Keith, Chris mentioned to me that you guys have seen this issue before. Any ideas from anyone else are much appreciated as well. I recently updated a project's dependencies to Accumulo 1.6.0 built with Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest component which is running all the time with a batch writer using many threads to push mutations into Accumulo. The issue I'm having is a show stopper. At different intervals of time, sometimes an hour, sometimes 30 minutes, I'm getting MutationsRejectedExceptions (server errors) from the TabletServerBatchWriter. Once they start, I need to restart the ingester to get them to stop. They always come back within 30 minutes to an hour... rinse, repeat. The exception always happens on different tablet servers. It's a thrift error saying a message was received out of sequence. In the TabletServer logs, I see an Invalid session id exception which happens only once before the client-side batch writer starts spitting out the MREs. I'm running some heavyweight processing in Storm along side the tablet servers. I shut that processing off in hopes that maybe it was the culprit but that hasn't fixed the issue. I'm surprised I haven't seen any other posts on the topic. Thanks!
Re: Tablet server thrift issue
Josh, Your advice is definitely useful- I also thought about catching the exception and retrying with a fresh batch writer but the fact that the batch writer failure doesn't go away without being re-instantiated is really only a nuisance. The TabletServerBatchWriter could be designed much better, I agree, but that is not the root of the problem. The Thrift exception that is causing the issue is what I'd like to get to the bottom of. It's throwing the following: *TApplicationException: applyUpdates failed: out of sequence response * I've never seen this exception before in regular use of the client API- but I also just updated to 1.6.0. Google isn't showing anything useful for how exactly this exception could come about other than using a bad threading model- and I don't see any drastic changes or other user complaints on the mailing list that would validate that line of thought. Quite frankly, I'm stumped. This could be a Thrift exception related to a Thrift bug or something bad on my system and have nothing to do with Accumulo. Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had seen the exception before and may remember what it was/how they fixed it. On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote: Don't mean to tell you that I don't think there might be a bug/otherwise, that's pretty much just the limit of what I know about the server-side sessions :) If you have concrete this worked in 1.4.4 and this happens instead with 1.6.0, that'd make a great ticket :D The BatchWriter failure case is pretty rough, actually. Eric has made some changes to help already (in 1.6.1, I think), but it needs an overhaul that I haven't been able to make time to fix properly, either. IIRC, the only guarantee you have is that all mutations added before the last flush() happened are durable on the server. Anything else is a guess. I don't know the specifics, but that should be enough to work with (and saving off mutations shouldn't be too costly since they're stored serialized). On 8/22/14, 5:44 PM, Corey Nolet wrote: Thanks Josh, I understand about the session ID completely but the problem I have is that the exact same client code worked, line for line, just fine in 1.4.4 and it's acting up in 1.6.0. I also seem to remember the BatchWriter automatically creating a new session when one expired without an exception causing it to fail on the client. I know we've made changes since 1.4.4 but I'd like to troubleshoot the actual issue of the BatchWriter failing due to the thrift exception rather than just catching the exception and trying mutations again. The other issue is that I've already submitted a bunch of mutations to the batch writer from different threads. Does that mean I need to be storing them off twice? (once in the BatchWriter's cache and once in my own) The BatchWriter in my ingester is constantly sending data and the tablet servers have been given more than enough memory to be able to keep up. There's no swap being used and the network isn't experiencing any errors. On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote: If you get an error from a BatchWriter, you pretty much have to throw away that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If you want, you should be able to catch/recover from this without having to restart the ingester. If the session ID is invalid, my guess is that it hasn't been used recently and the tserver cleaned it up. The exception logic isn't the greatest (as it just is presented to you as a RTE). https://issues.apache.org/jira/browse/ACCUMULO-2990 On 8/22/14, 4:35 PM, Corey Nolet wrote: Eric Keith, Chris mentioned to me that you guys have seen this issue before. Any ideas from anyone else are much appreciated as well. I recently updated a project's dependencies to Accumulo 1.6.0 built with Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest component which is running all the time with a batch writer using many threads to push mutations into Accumulo. The issue I'm having is a show stopper. At different intervals of time, sometimes an hour, sometimes 30 minutes, I'm getting MutationsRejectedExceptions (server errors) from the TabletServerBatchWriter. Once they start, I need to restart the ingester to get them to stop. They always come back within 30 minutes to an hour... rinse, repeat. The exception always happens on different tablet servers. It's a thrift error saying a message was received out of sequence. In the TabletServer logs, I see an Invalid session id exception which happens only once before the client-side batch writer starts spitting out the MREs. I'm running some heavyweight processing in Storm along side the tablet servers. I shut that processing off in hopes that maybe it was the culprit but that hasn't fixed the issue. I'm surprised I haven't seen any other posts on the topic
Re: AccumuloMultiTableInputFormat IllegalStatementException
Hey John, Could you give an example of one of the ranges you are using which causes this to happen? On Fri, Aug 22, 2014 at 11:02 PM, John Yost soozandjohny...@gmail.com wrote: Hey Everyone, The AccumuloMultiTableInputFormat is an awesome addition to the Accumulo API and I am really excited to start using it. My first attempt with the 1.6.0 release resulted in this IllegalStateException: java.lang.IllegalStateException: The table query configurations could not be deserialized from the given configuration at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.getInputTableConfigs(InputConfigurator.java:566) at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:628) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:342) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:537) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286) at com.johnyostanalytics.mapreduce.client.TwitterJoin.run(TwitterJoin.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) when I attempt to initialize the AccumuloMultiTableInputFormat: InputTableConfig baseConfig = new InputTableConfig(); baseConfig.setRanges(ranges); InputTableConfig edgeConfig = new InputTableConfig(); edgeConfig.setRanges(ranges); configs.put(base, baseConfig); configs.put(edges,edgeConfig); AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); Any ideas as to what may be going on? I know that the table names are valid and that the Range objects are valid because I tested all of that independently via Accumulo scans. Any guidance is greatly appreciated because, again, AcumuloMultiTableInputFormat is really cool and I am really looking forward to using it. Thanks --John
Re: AccumuloMultiTableInputFormat IllegalStatementException
The table configs get serialized as base64 and placed in the job's Configuration under the key AccumuloInputFormat.ScanOpts.TableConfigs. Could you verify/print what's being placed in this key in your configuration? On Sat, Aug 23, 2014 at 12:15 AM, JavaHokie soozandjohny...@gmail.com wrote: Hey Corey, Sure thing! Here is my code: MapString,InputTableConfig configs = new HashMapString,InputTableConfig(); ListRange ranges = Lists.newArrayList(new Range(104587),new Range(105255)); InputTableConfig edgeConfig = new InputTableConfig(); edgeConfig.setRanges(ranges); InputTableConfig followerConfig = new InputTableConfig(); followerConfig.setRanges(ranges); configs.put(following,followerConfig); configs.put(twitteredges,edgeConfig); These are the row values I am using to join entries from the following and twitteredges tables. --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11189.html Sent from the Users mailing list archive at Nabble.com.
Re: AccumuloMultiTableInputFormat IllegalStatementException
The tests I'm running aren't using the native Hadoop libs either. If you don't mind, a little more code as to how you are setting up your job would be useful. That's weird the key in the config would be null. Are you using the job.getConfiguration()? On Sat, Aug 23, 2014 at 12:31 AM, JavaHokie soozandjohny...@gmail.com wrote: Hey Corey, Gotcha, i get a null when I attempt to log the value: log.debug(configuration.get(AccumuloInputFormat.ScanOpts.TableConfigs)); --John -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11191.html Sent from the Users mailing list archive at Nabble.com.
Re: AccumuloMultiTableInputFormat IllegalStatementException
Also, if you don't mind me asking, why isn't your job setup class extending Configured? That was you are picking up configurations injected from the environment. You would do MyJobSetUpClass extends Configured Then use getConf() instead of newing up a new configuration. On Sat, Aug 23, 2014 at 1:11 AM, Corey Nolet cjno...@gmail.com wrote: Job.getInstance(configuration) copies the configuration and makes its own. Try doing your debug statement from earlier on job.getConfiguration() and let's see what the base64 string looks like. On Sat, Aug 23, 2014 at 1:00 AM, JavaHokie soozandjohny...@gmail.com wrote: Sure thing, here's my run method implementation: Configuration configuration = new Configuration(); configuration.set(fs.defaultFS, hdfs://127.0.0.1:8020); configuration.set(mapreduce.job.tracker, localhost:54311); configuration.set(mapreduce.framework.name, yarn); configuration.set(yarn.resourcemanager.address, localhost:8032); Job job = Job.getInstance(configuration); /* * Set the basic stuff */ job.setJobName(TwitterJoin Query); job.setJarByClass(TwitterJoin.class); /* * Set Mapper and Reducer Classes */ job.setMapperClass(TwitterJoinMapper.class); job.setReducerClass(TwitterJoinReducer.class); /* * Set the Mapper MapOutputKeyClass and MapOutputValueClass */ job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); /* * Set the Reducer OutputKeyClass and OutputValueClass */ job.setOutputKeyClass(Text.class); job.setOutputValueClass(Mutation.class); /* * Set InputFormat and OutputFormat classes */ job.setInputFormatClass(AccumuloMultiTableInputFormat.class); job.setOutputFormatClass(AccumuloOutputFormat.class); /* * Configure InputFormat and OutputFormat Classes */ MapString,InputTableConfig configs = new HashMapString,InputTableConfig(); ListRange ranges = Lists.newArrayList(new Range(104587),new Range(105255)); InputTableConfig edgeConfig = new InputTableConfig(); edgeConfig.setRanges(ranges); edgeConfig.setAutoAdjustRanges(true); InputTableConfig followerConfig = new InputTableConfig(); followerConfig.setRanges(ranges); followerConfig.setAutoAdjustRanges(true); configs.put(following,followerConfig); configs.put(twitteredges,edgeConfig); AccumuloMultiTableInputFormat.setConnectorInfo(job,root,new PasswordToken(.getBytes())); AccumuloMultiTableInputFormat.setZooKeeperInstance(job,localhost,localhost); AccumuloMultiTableInputFormat.setScanAuthorizations(job,new Authorizations(private)); AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs); AccumuloOutputFormat.setZooKeeperInstance(job,localhost,localhost); AccumuloOutputFormat.setConnectorInfo(job,root,new PasswordToken(.getBytes())); AccumuloOutputFormat.setCreateTables(job,true); AccumuloOutputFormat.setDefaultTableName(job,twitteredgerollup); /* * Kick off the job, wait for completion, and return applicable code */ boolean success = job.waitForCompletion(true); if (success) { return 0; } return 1; } -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11193.html Sent from the Users mailing list archive at Nabble.com.
Re: AccumuloMultiTableInputFormat IllegalStatementException
That code I posted should be able to validate where you are getting hung up. Can you try running that on the machine and seeing if it prints the expected tables/ranges? Also, are you running the job live? What does the configuration look like for the job on your resource manager? Can you see if the base64 matches? On Sat, Aug 23, 2014 at 1:47 AM, JavaHokie soozandjohny...@gmail.com wrote: H...the byte[] array is generated OK. byte[] bytes = Base64.decodeBase64(configString.getBytes(StandardCharsets.UTF_8)); I wonder what's golng wrong with one of these lines below? ByteArrayInputStream bais = new ByteArrayInputStream(bytes); mapWritable.readFields(new DataInputStream(bais)); -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11199.html Sent from the Users mailing list archive at Nabble.com.
Re: Kafka + Storm
Kafka is also distributed in nature, which is not something easily achieved by queuing brokers like ActiveMQ or JMS (1.0) in general. Kafka allows data to be partitioned across many machines which can grow as necessary as your data grows. On Thu, Aug 14, 2014 at 11:20 PM, Justin Workman justinjwork...@gmail.com wrote: Absolutely! Sent from my iPhone On Aug 14, 2014, at 9:02 PM, anand nalya anand.na...@gmail.com wrote: I agree, not for the long run but for small bursts in data production rate, say peak hours, Kafka can help in providing a somewhat consistent load on Storm cluster. -- From: Justin Workman justinjwork...@gmail.com Sent: 15-08-2014 07:53 To: user@storm.incubator.apache.org Subject: Re: Kafka + Storm I suppose not directly. It depends on the lifetime of your Kafka queues and on your latency requirements. You need to make sure you have enough doctors or in storm language workers, in your storm cluster to process your messages within your SLA. For our case we, we have a 3 hour lifetime or ttl configured for our queues. Meaning records in the queue older than 3 hours are purged. We also have an internal SLA ( team goal, not published to the business ;)) of 10 seconds from event to end of stream and available for end user consumption. So we need to make sure we have enough storm workers to to meet; 1) the normal SLA and 2) be able to catch up on the queues when we have to take storm down for maintenance and such and the queues build. There are many knobs you can tune for both storm and Kafka. We have spent many hours tuning things to meet our SLAs. Justin Sent from my iPhone On Aug 14, 2014, at 8:05 PM, anand nalya anand.na...@gmail.com wrote: Also, since Kafka acts as a buffer, storm is not directly affected by the speed of your data sources/producers. -- From: Justin Workman justinjwork...@gmail.com Sent: 15-08-2014 07:12 To: user@storm.incubator.apache.org Subject: Re: Kafka + Storm Good analogy! Sent from my iPhone On Aug 14, 2014, at 7:36 PM, Adaryl \Bob\ Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Ah so Storm is the hospital and Kafka is the waiting room where everybody queues up to be seen in turn yes? Adaryl Bob Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData *From:* Justin Workman justinjwork...@gmail.com *Sent:* Thursday, August 14, 2014 7:47 PM *To:* user@storm.incubator.apache.org *Subject:* Re: Kafka + Storm If you are familiar with Weblogic or ActiveMQ, it is similar. Let's see if I can explain, I am definitely not a subject matter expert on this. Within Kafka you can create queues, ie a webclicks queue. Your web servers can then send click events to this queue in Kafka. The web servers, or agent writing the events to this queue are referred to as the producer. Each event, or message in Kafka is assigned an id. On the other side there are consumers, in storms case this would be the storm Kafka spout, that can subscribe to this webclicks queue to consume the messages that are in the queue. The consumer can consume a single message from the queue, or a batch of messages, as storm does. The consumer keeps track of the latest offset, Kafka message id, that it has consumed. This way the next time the consumer checks to see if there are more messages to consume it will ask for messages with a message id greater than its last offset. This helps with the reliability of the event stream and helps guarantee that your events/message make it start to finish through your stream, assuming the events get to Kafka ;) Hope this helps and makes some sort of sense. Again, sent from my iPhone ;) Justin Sent from my iPhone On Aug 14, 2014, at 6:28 PM, Adaryl \Bob\ Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: I get your reasoning at a high level. I should have specified that I wasn’t sure what Kafka does. I don’t have a hard software engineering background. I know that Kafka is “a message queuing” system, but I don’t really know what that means. (I can’t believe you wrote all that from your iPhone) B. *From:* Justin Workman justinjwork...@gmail.com *Sent:* Thursday, August 14, 2014 7:22 PM *To:* user@storm.incubator.apache.org *Subject:* Re: Kafka + Storm Personally, we looked at several options, including writing our own storm source. There are limited storm sources with community support out there. For us, it boiled down to the following; 1) community support and what appeared to be a standard method. Storm has now included the kafka source as a bundled component to storm. This made the implementation much faster, because the code was done. 2) the durability (replication and clustering) of Kafka. We have a three hour retention period on our queues, so if we need to do maintenance on storm or deploy an updated topology,
Re: Good way to test when topology in local cluster is fully active
Vincent P.Taylor, I played with the testing framework for a little bit last night and don't see any easy way to provide pauses in between the emissions of the mock tuples. For instance, my sliding window semantics are heavily orchestrated by time evictions and triggers which mean that I need to be able to time the tuples being fed into the tests (i.e. emit a tuple every 500ms and run the test for 25 secons). On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote: My guess is that the slowdown you are seeing is a result of the new version of ZooKeeper and how it handles IPv4/6. Try adding the following JVM parameter when running your tests: -Djava.net.preferIPv4Stack=true -Taylor On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm testing some sliding window algorithms with tuples emitted from a mock spout based on a timer but the amount of time it takes the topology to fully start up and activate seems to vary from computer to computer. Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my tests are breaking because the time to activate the topology is taking longer (because of Netty possibly?). I'd like to make my tests more resilient to things like this. Is there something I can look at in LocalCluster where I could do while(!notActive) { Thread.sleep(50) } ? This is what my test looks like currently: StormTopology topology = buildTopology(...); Config conf = new Config(); conf.setNumWorkers(1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(getTopologyName(), conf, topology); try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } cluster.shutdown(); assertEquals(4, MockSinkBolt.getEvents().size()); Thanks!
Re: Good way to test when topology in local cluster is fully active
Sorry- the ipv4 fix worked. On Tue, Aug 5, 2014 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote: This did work. Thanks! On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote: My guess is that the slowdown you are seeing is a result of the new version of ZooKeeper and how it handles IPv4/6. Try adding the following JVM parameter when running your tests: -Djava.net.preferIPv4Stack=true -Taylor On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm testing some sliding window algorithms with tuples emitted from a mock spout based on a timer but the amount of time it takes the topology to fully start up and activate seems to vary from computer to computer. Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my tests are breaking because the time to activate the topology is taking longer (because of Netty possibly?). I'd like to make my tests more resilient to things like this. Is there something I can look at in LocalCluster where I could do while(!notActive) { Thread.sleep(50) } ? This is what my test looks like currently: StormTopology topology = buildTopology(...); Config conf = new Config(); conf.setNumWorkers(1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(getTopologyName(), conf, topology); try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } cluster.shutdown(); assertEquals(4, MockSinkBolt.getEvents().size()); Thanks!
Re: Good way to test when topology in local cluster is fully active
This did work. Thanks! On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote: My guess is that the slowdown you are seeing is a result of the new version of ZooKeeper and how it handles IPv4/6. Try adding the following JVM parameter when running your tests: -Djava.net.preferIPv4Stack=true -Taylor On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm testing some sliding window algorithms with tuples emitted from a mock spout based on a timer but the amount of time it takes the topology to fully start up and activate seems to vary from computer to computer. Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my tests are breaking because the time to activate the topology is taking longer (because of Netty possibly?). I'd like to make my tests more resilient to things like this. Is there something I can look at in LocalCluster where I could do while(!notActive) { Thread.sleep(50) } ? This is what my test looks like currently: StormTopology topology = buildTopology(...); Config conf = new Config(); conf.setNumWorkers(1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(getTopologyName(), conf, topology); try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } cluster.shutdown(); assertEquals(4, MockSinkBolt.getEvents().size()); Thanks!
Re: Good way to test when topology in local cluster is fully active
Oh Nice. Is this new in 0.9.*? I just updated so I haven't looked much into what's changed yet, other than Netty. On Mon, Aug 4, 2014 at 10:40 PM, Vincent Russell vincent.russ...@gmail.com wrote: Corey, Have you tried using the integration testing framework that comes with storm? Testing.withSimulatedTimeLocalCluster(mkClusterParam, new TestJob() { @Override public void run(ILocalCluster cluster) throws Exception { CompleteTopologyParam completeTopologyParam = new CompleteTopologyParam(); completeTopologyParam .setMockedSources(mockedSources); completeTopologyParam.setStormConf(daemonConf); completeTopologyParam.setTopologyName(getTopologyName()); Map result = Testing.completeTopology(cluster, topology, completeTopologyParam); }); -Vincent On Mon, Aug 4, 2014 at 8:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm testing some sliding window algorithms with tuples emitted from a mock spout based on a timer but the amount of time it takes the topology to fully start up and activate seems to vary from computer to computer. Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my tests are breaking because the time to activate the topology is taking longer (because of Netty possibly?). I'd like to make my tests more resilient to things like this. Is there something I can look at in LocalCluster where I could do while(!notActive) { Thread.sleep(50) } ? This is what my test looks like currently: StormTopology topology = buildTopology(...); Config conf = new Config(); conf.setNumWorkers(1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(getTopologyName(), conf, topology); try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } cluster.shutdown(); assertEquals(4, MockSinkBolt.getEvents().size()); Thanks!