Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
based on some predefined business cases. After updating to 1.2.0, some of
the concurrency expectations about how the stages within jobs are executed
have changed quite significantly.

Given 3 RDDs:

RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache()
RDD2 = RDD1.outputToFile
RDD3 = RDD1.groupBy().outputToFile

In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage
encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and
RDD3 to both block waiting for RDD1 to complete and cache- at which point
RDD2 and RDD3 both use the cached version to complete their work.

Spark 1.2.0 seems to schedule two (be it concurrently running) stages for
each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each
get run twice). It does not look like there is any sharing of the results
between these jobs.

Are we doing something wrong? Is there a setting that I'm not understanding
somewhere?


Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
Sorry- replace ### with an actual number. What does a skipped stage mean?
I'm running a series of jobs and it seems like after a certain point, the
number of skipped stages is larger than the number of actual completed
stages.

On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the number of skipped stages couldn't be formatted.

 Cheers

 On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:

 We just upgraded to Spark 1.2.0 and we're seeing this in the UI.





What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.


Re: Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
I asked this question too soon. I am caching off a bunch of RDDs in a
TrieMap so that our framework can wire them together and the locking was
not completely correct- therefore it was creating multiple new RDDs at
times instead of using cached versions- which were creating completely
separate lineages.

What's strange is that this bug only surfaced when I updated Spark.

On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet cjno...@gmail.com wrote:

 We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
 that we've been developing that connects various different RDDs together
 based on some predefined business cases. After updating to 1.2.0, some of
 the concurrency expectations about how the stages within jobs are executed
 have changed quite significantly.

 Given 3 RDDs:

 RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache()
 RDD2 = RDD1.outputToFile
 RDD3 = RDD1.groupBy().outputToFile

 In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage
 encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and
 RDD3 to both block waiting for RDD1 to complete and cache- at which point
 RDD2 and RDD3 both use the cached version to complete their work.

 Spark 1.2.0 seems to schedule two (be it concurrently running) stages for
 each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each
 get run twice). It does not look like there is any sharing of the results
 between these jobs.

 Are we doing something wrong? Is there a setting that I'm not
 understanding somewhere?



Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2015-01-06 Thread Corey Nolet


 On Jan. 5, 2015, 9:09 p.m., Christopher Tubbs wrote:
  core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java,
   line 63
  https://reviews.apache.org/r/29502/diff/2/?file=804705#file804705line63
 
  Should this be Authorizations.EMPTY? Or should it have a default 
  implementation on WrappingIterator which calls source.getAuthorizations()?
 
 Christopher Tubbs wrote:
 make that `getSource().getAuthorizations()`

Specific to this test I returned null because all the other getters (other than 
what was being explicitly tested) were returning null. Were you thinking 
WrappingIterator should also provide a getAuthorizations() method?


 On Jan. 5, 2015, 9:09 p.m., Christopher Tubbs wrote:
  server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java,
   line 46
  https://reviews.apache.org/r/29502/diff/2/?file=804711#file804711line46
 
  I wonder if there's a better way to provide environment options, like 
  this and others, at specific scopes. Maybe use some dependency injection, 
  with annotations, like Servlet @Context or JUnit @Rule: @ScanContext 
  Authorizations auths; (throw error if type is not appropriate for context 
  during injection).

This feature would be pretty neat. Were you thinking this would extend past 
just the IteratorEnvironment into other places? Any other fields you can think 
of that would benefit from this change other than Authorizations?


- Corey


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/#review66725
---


On Dec. 31, 2014, 3:40 p.m., Corey Nolet wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29502/
 ---
 
 (Updated Dec. 31, 2014, 3:40 p.m.)
 
 
 Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
 kturner.
 
 
 Bugs: ACCUMULO-3458
 https://issues.apache.org/jira/browse/ACCUMULO-3458
 
 
 Repository: accumulo
 
 
 Description
 ---
 
 ACCUMULO-3458 Propagating scan-time authorizations through the 
 IteratorEnvironment so that scan-time iterators can use them.
 
 
 Diffs
 -
 
   
 core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
  4903656 
   core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
   core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
 2552682 
   core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
 666a8af 
   core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
 9726266 
   
 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
  2a79f05 
   
 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
 72cb863 
   
 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java
  9e20cb1 
   
 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
  15c33fa 
   
 core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
  94da7b5 
   
 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
  fa46360 
   
 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
  4521e55 
   
 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
  4cebab7 
   
 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
  4a45e99 
   
 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
  a9801b0 
   
 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
  bf35557 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
  d1fece5 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java
  869cc33 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
  fe4b16b 
   test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
 PRE-CREATION 
   test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/29502/diff/
 
 
 Testing
 ---
 
 Wrote an integration test to verify that ScanDataSource is actually setting 
 the authorizations on the IteratorEnvironment
 
 
 Thanks,
 
 Corey Nolet
 




Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2015-01-06 Thread Corey Nolet

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/
---

(Updated Jan. 6, 2015, 3:44 p.m.)


Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
kturner.


Changes
---

Fixed based on feedback from Christopher and Keith. Noticed some extra 
formatting removing whitespace in some places.


Bugs: ACCUMULO-3458
https://issues.apache.org/jira/browse/ACCUMULO-3458


Repository: accumulo


Description
---

ACCUMULO-3458 Propagating scan-time authorizations through the 
IteratorEnvironment so that scan-time iterators can use them.


Diffs (updated)
-

  core/src/main/java/org/apache/accumulo/core/client/BatchDeleter.java 2bfc347 
  
core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
 4903656 
  core/src/main/java/org/apache/accumulo/core/client/Scanner.java 112179e 
  core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
  core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
2552682 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
666a8af 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerIterator.java 
1e0ac99 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
9726266 
  
core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
 2a79f05 
  core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
72cb863 
  
core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 
9e20cb1 
  core/src/main/java/org/apache/accumulo/core/iterators/WrappingIterator.java 
060fa76 
  
core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
 15c33fa 
  core/src/test/java/org/apache/accumulo/core/client/impl/ScannerImplTest.java 
be4d467 
  
core/src/test/java/org/apache/accumulo/core/client/impl/TabletServerBatchReaderTest.java
 PRE-CREATION 
  
core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
 94da7b5 
  
core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
 fa46360 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
 4521e55 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
 4cebab7 
  
server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
 4a45e99 
  
server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
 a9801b0 
  
server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
 bf35557 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
 d1fece5 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 
869cc33 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
 fe4b16b 
  test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
PRE-CREATION 
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION 

Diff: https://reviews.apache.org/r/29502/diff/


Testing
---

Wrote an integration test to verify that ScanDataSource is actually setting the 
authorizations on the IteratorEnvironment


Thanks,

Corey Nolet



Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2015-01-06 Thread Corey Nolet

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/
---

(Updated Jan. 6, 2015, 3:54 p.m.)


Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
kturner.


Changes
---

Removing files which were formatted but not changed in any other way to augment 
the feature in the commit.


Bugs: ACCUMULO-3458
https://issues.apache.org/jira/browse/ACCUMULO-3458


Repository: accumulo


Description
---

ACCUMULO-3458 Propagating scan-time authorizations through the 
IteratorEnvironment so that scan-time iterators can use them.


Diffs (updated)
-

  
core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
 4903656 
  core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
  core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
2552682 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
666a8af 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
9726266 
  
core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
 2a79f05 
  core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
72cb863 
  
core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 
9e20cb1 
  core/src/test/java/org/apache/accumulo/core/client/impl/ScannerImplTest.java 
be4d467 
  
core/src/test/java/org/apache/accumulo/core/client/impl/TabletServerBatchReaderTest.java
 PRE-CREATION 
  
core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
 94da7b5 
  
core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
 fa46360 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
 4521e55 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
 4cebab7 
  
server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
 4a45e99 
  
server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
 a9801b0 
  
server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
 bf35557 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
 d1fece5 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 
869cc33 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
 fe4b16b 
  test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
PRE-CREATION 
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION 

Diff: https://reviews.apache.org/r/29502/diff/


Testing
---

Wrote an integration test to verify that ScanDataSource is actually setting the 
authorizations on the IteratorEnvironment


Thanks,

Corey Nolet



Re: Write and Read file through map reduce

2015-01-05 Thread Corey Nolet
Hitarth,

I don't know how much direction you are looking for with regards to the
formats of the times but you can certainly read both files into the third
mapreduce job using the FileInputFormat by comma-separating the paths to
the files. The blocks for both files will essentially be unioned together
and the mappers scheduled across your cluster.

On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi t.hita...@gmail.com wrote:

 Hi,

 I have 6 node cluster, and the scenario is as follows :-

 I have one map reduce job which will write file1 in HDFS.
 I have another map reduce job which will write file2 in  HDFS.
 In the third map reduce job I need to use file1 and file2 to do some
 computation and output the value.

 What is the best way to store file1 and file2 in HDFS so that they could
 be used in third map reduce job.

 Thanks,
 Hitarth



Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Corey Nolet
Took me just about all night (it's 3am here in EST) but I finally figured
out how to get this working. I pushed up my example code for others who may
be struggling with this same problem. It really took an understanding of
how the classpath needs to be configured both in YARN and in the client
driver application.

Here's the example code on github:
https://github.com/cjnolet/spark-jetty-server

On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote:

 So looking @ the actual code- I see where it looks like --class 'notused'
 --jar null is set on the ClientBase.scala when yarn is being run in client
 mode. One thing I noticed is that the jar is being set by trying to grab
 the jar's uri from the classpath resources- in this case I think it's
 finding the spark-yarn jar instead of spark-assembly so when it tries to
 runt the ExecutorLauncher.scala, none of the core classes (like
 org.apache.spark.Logging) are going to be available on the classpath.

 I hope this is the root of the issue. I'll keep this thread updated with
 my findings.

 On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:

 .. and looking even further, it looks like the actual command tha'ts
 executed starting up the JVM to run the
 org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
 'notused' --jar null.

 I would assume this isn't expected but I don't see where to set these
 properties or why they aren't making it through.

 On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is
 being submitted through yarn-client. I'm trying two different approaches
 and both seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib
 jars from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar
 containing my code.


 When yarn tries to create the container, I get an exception in the
 driver Yarn application already ended, might be killed or not able to
 launch application master. When I look into the logs for the nodemanager,
 I see NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.







Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
Looking a little closer @ the launch_container.sh file, it appears to be
adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
the directory pointed to by PWD. Any ideas?

On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib jars
 from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the driver
 Yarn application already ended, might be killed or not able to launch
 application master. When I look into the logs for the nodemanager, I see
 NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.



Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
I'm trying to get a SparkContext going in a web container which is being
submitted through yarn-client. I'm trying two different approaches and both
seem to be resulting in the same error from the yarn nodemanagers:

1) I'm newing up a spark context direct, manually adding all the lib jars
from Spark and Hadoop to the setJars() method on the SparkConf.

2) I'm using SparkSubmit,main() to pass the classname and jar containing my
code.


When yarn tries to create the container, I get an exception in the driver
Yarn application already ended, might be killed or not able to launch
application master. When I look into the logs for the nodemanager, I see
NoClassDefFoundError: org/apache/spark/Logging.

Looking closer @ the contents of the nodemanagers, I see that the spark
yarn jar was renamed to __spark__.jar and placed in the app cache while the
rest of the libraries I specified via setJars() were all placed in the file
cache. Any ideas as to what may be happening? I even tried adding the
spark-core dependency and uber-jarring my own classes so that the
dependencies would be there when Yarn tries to create the container.


Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
So looking @ the actual code- I see where it looks like --class 'notused'
--jar null is set on the ClientBase.scala when yarn is being run in client
mode. One thing I noticed is that the jar is being set by trying to grab
the jar's uri from the classpath resources- in this case I think it's
finding the spark-yarn jar instead of spark-assembly so when it tries to
runt the ExecutorLauncher.scala, none of the core classes (like
org.apache.spark.Logging) are going to be available on the classpath.

I hope this is the root of the issue. I'll keep this thread updated with my
findings.

On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote:

 .. and looking even further, it looks like the actual command tha'ts
 executed starting up the JVM to run the
 org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
 'notused' --jar null.

 I would assume this isn't expected but I don't see where to set these
 properties or why they aren't making it through.

 On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib
 jars from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the
 driver Yarn application already ended, might be killed or not able to
 launch application master. When I look into the logs for the nodemanager,
 I see NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.






Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
.. and looking even further, it looks like the actual command tha'ts
executed starting up the JVM to run the
org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class
'notused' --jar null.

I would assume this isn't expected but I don't see where to set these
properties or why they aren't making it through.

On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer @ the launch_container.sh file, it appears to be
 adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in
 the directory pointed to by PWD. Any ideas?

 On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to get a SparkContext going in a web container which is being
 submitted through yarn-client. I'm trying two different approaches and both
 seem to be resulting in the same error from the yarn nodemanagers:

 1) I'm newing up a spark context direct, manually adding all the lib jars
 from Spark and Hadoop to the setJars() method on the SparkConf.

 2) I'm using SparkSubmit,main() to pass the classname and jar containing
 my code.


 When yarn tries to create the container, I get an exception in the driver
 Yarn application already ended, might be killed or not able to launch
 application master. When I look into the logs for the nodemanager, I see
 NoClassDefFoundError: org/apache/spark/Logging.

 Looking closer @ the contents of the nodemanagers, I see that the spark
 yarn jar was renamed to __spark__.jar and placed in the app cache while the
 rest of the libraries I specified via setJars() were all placed in the file
 cache. Any ideas as to what may be happening? I even tried adding the
 spark-core dependency and uber-jarring my own classes so that the
 dependencies would be there when Yarn tries to create the container.





Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2014-12-31 Thread Corey Nolet

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/
---

(Updated Dec. 31, 2014, 1:46 p.m.)


Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
kturner.


Repository: accumulo


Description
---

ACCUMULO-3458 Propagating scan-time authorizations through the 
IteratorEnvironment so that scan-time iterators can use them.


Diffs (updated)
-

  
core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
 4903656 
  core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
  core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
2552682 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
666a8af 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
9726266 
  
core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
 2a79f05 
  core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
72cb863 
  
core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 
9e20cb1 
  
core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
 15c33fa 
  
core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
 94da7b5 
  
core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
 fa46360 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
 4521e55 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
 4cebab7 
  
server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
 4a45e99 
  
server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
 a9801b0 
  
server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
 bf35557 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
 d1fece5 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 
869cc33 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
 fe4b16b 
  test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
PRE-CREATION 
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION 

Diff: https://reviews.apache.org/r/29502/diff/


Testing
---

Wrote an integration test to verify that ScanDataSource is actually setting the 
authorizations on the IteratorEnvironment


Thanks,

Corey Nolet



Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2014-12-31 Thread Corey Nolet


 On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote:
  core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java,
   line 197
  https://reviews.apache.org/r/29502/diff/1/?file=804415#file804415line197
 
  Can't you pull this from the Scanner?

I didn't see a good way to get this info from the scanner. The more I think 
about this- a simple getter on the scanner would be massively useful.


 On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote:
  server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java,
   line 55
  https://reviews.apache.org/r/29502/diff/1/?file=804426#file804426line55
 
  It looks like TabletIteratorEnvironment is used for minor compactions. 
  Isn't always setting `Authorizations.EMPTY` a little misleading? Is there 
  something more representative of having all auths we could do here? Maybe 
  extra documentation is enough? Could also throw 
  UnsupportedOperationException or similar when the IteratorScope is 
  something that isn't SCAN?

Good point! This should definitely be documented as a scan-time only operation. 
I'm on the fence about throwing an exception- I think I could go either way on 
that.


 On Dec. 31, 2014, 4:30 a.m., Josh Elser wrote:
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java, line 54
  https://reviews.apache.org/r/29502/diff/1/?file=804430#file804430line54
 
  Please create a user, assign it the auths you need, and then remove the 
  user after the test.
  
  If this test is run against a standalone instance, it should try to 
  leave the system in the same state the test started in.

You know I was thinking about this when I was coding the test and totally 
forgot to change it before I created the patch.


- Corey


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/#review66439
---


On Dec. 31, 2014, 1:46 p.m., Corey Nolet wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29502/
 ---
 
 (Updated Dec. 31, 2014, 1:46 p.m.)
 
 
 Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
 kturner.
 
 
 Repository: accumulo
 
 
 Description
 ---
 
 ACCUMULO-3458 Propagating scan-time authorizations through the 
 IteratorEnvironment so that scan-time iterators can use them.
 
 
 Diffs
 -
 
   
 core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
  4903656 
   core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
   core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
 2552682 
   core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
 666a8af 
   core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
 9726266 
   
 core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
  2a79f05 
   
 core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
 72cb863 
   
 core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java
  9e20cb1 
   
 core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
  15c33fa 
   
 core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
  94da7b5 
   
 core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
  fa46360 
   
 core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
  4521e55 
   
 core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
  4cebab7 
   
 server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
  4a45e99 
   
 server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
  a9801b0 
   
 server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
  bf35557 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
  d1fece5 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java
  869cc33 
   
 server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
  fe4b16b 
   test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
 PRE-CREATION 
   test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/29502/diff/
 
 
 Testing
 ---
 
 Wrote an integration test to verify that ScanDataSource is actually setting 
 the authorizations on the IteratorEnvironment
 
 
 Thanks,
 
 Corey Nolet
 




Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2014-12-31 Thread Corey Nolet

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/
---

(Updated Dec. 31, 2014, 10:40 a.m.)


Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
kturner.


Bugs: ACCUMULO-3458
https://issues.apache.org/jira/browse/ACCUMULO-3458


Repository: accumulo


Description
---

ACCUMULO-3458 Propagating scan-time authorizations through the 
IteratorEnvironment so that scan-time iterators can use them.


Diffs
-

  
core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
 4903656 
  core/src/main/java/org/apache/accumulo/core/client/ScannerBase.java 335b63a 
  core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
2552682 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerImpl.java 
666a8af 
  core/src/main/java/org/apache/accumulo/core/client/impl/ScannerOptions.java 
9726266 
  
core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchReader.java
 2a79f05 
  core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
72cb863 
  
core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 
9e20cb1 
  
core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
 15c33fa 
  
core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
 94da7b5 
  
core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
 fa46360 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
 4521e55 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
 4cebab7 
  
server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
 4a45e99 
  
server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
 a9801b0 
  
server/monitor/src/main/java/org/apache/accumulo/monitor/servlets/trace/NullScanner.java
 bf35557 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
 d1fece5 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 
869cc33 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
 fe4b16b 
  test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
PRE-CREATION 
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION 

Diff: https://reviews.apache.org/r/29502/diff/


Testing
---

Wrote an integration test to verify that ScanDataSource is actually setting the 
authorizations on the IteratorEnvironment


Thanks,

Corey Nolet



Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2014-12-30 Thread Corey Nolet

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29502/
---

(Updated Dec. 31, 2014, 4:05 a.m.)


Review request for accumulo, Christopher Tubbs, Eric Newton, Josh Elser, and 
kturner.


Changes
---

Added accumulo group to review.


Repository: accumulo


Description
---

ACCUMULO-3458 Propagating scan-time authorizations through the 
IteratorEnvironment so that scan-time iterators can use them.


Diffs
-

  
core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java
 4903656 
  core/src/main/java/org/apache/accumulo/core/client/impl/OfflineScanner.java 
2552682 
  core/src/main/java/org/apache/accumulo/core/client/mock/MockScannerBase.java 
72cb863 
  
core/src/main/java/org/apache/accumulo/core/iterators/IteratorEnvironment.java 
9e20cb1 
  
core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java
 15c33fa 
  
core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java
 94da7b5 
  
core/src/test/java/org/apache/accumulo/core/iterators/FirstEntryInRowIteratorTest.java
 fa46360 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java
 4521e55 
  
core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
 4cebab7 
  
server/base/src/test/java/org/apache/accumulo/server/iterators/MetadataBulkLoadFilterTest.java
 4a45e99 
  
server/base/src/test/java/org/apache/accumulo/server/replication/StatusCombinerTest.java
 a9801b0 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java
 d1fece5 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java 
869cc33 
  
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java
 fe4b16b 
  test/src/main/java/org/apache/accumulo/test/functional/AuthsIterator.java 
PRE-CREATION 
  test/src/test/java/org/apache/accumulo/test/ScanIteratorIT.java PRE-CREATION 

Diff: https://reviews.apache.org/r/29502/diff/


Testing
---

Wrote an integration test to verify that ScanDataSource is actually setting the 
authorizations on the IteratorEnvironment


Thanks,

Corey Nolet



Submit spark jobs inside web application

2014-12-29 Thread Corey Nolet
I want to have a SparkContext inside of a  web application running in Jetty
that i can use to submit jobs to a cluster of Spark executors. I am running
YARN.

Ultimately, I would love it if I could just use somethjing like
SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp
is deployed and de-allocate them whent he webapp goes down. The problem is,
SparkSubmit, just like the spark-submit script, requires that I have a JAR
that can be deployed to all the nodes right away. Perhaps I'm not
understanding enough about how SPark works internally, but i thought it was
similar to the Scala shell where closures can be shipped off to executors
at runtime.

I was wondering if another approach would be to start up an app in
yarn-client mode that does nothing and then have my web-application connect
to that master.

Any ideas?


Thanks.


How to tell if RDD no longer has any children

2014-12-29 Thread Corey Nolet
Let's say I have an RDD which gets cached and has two children which do
something with it:

val rdd1 = ...cache()

rdd1.saveAsSequenceFile()

rdd1.groupBy()..saveAsSequenceFile()

If I were to submit both calls to saveAsSequenceFile() in  thread to take
advantage of concurrency (where possible), what's the best way to determine
when rdd1 is no longer being used by anything?

I'm hoping the best way is not to do reference counting in the futures that
are running the saveAsSequenceFile().


Cached RDD

2014-12-29 Thread Corey Nolet
If I have 2 RDDs which depend on the same RDD like the following:

val rdd1 = ...

val rdd2 = rdd1.groupBy()...

val rdd3 = rdd1.groupBy()...


If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
and one for rdd3)?


Re: When will spark 1.2 released?

2014-12-19 Thread Corey Nolet
 The dates of the jars were still of Dec 10th.

I figured that was because the jars were staged in Nexus on that date
(before the vote).

On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looking at:
 http://search.maven.org/#browse%7C717101892

 The dates of the jars were still of Dec 10th.

 Was I looking at the wrong place ?

 Cheers

 On Thu, Dec 18, 2014 at 11:10 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Yup, as he posted before, An Apache infrastructure issue prevented me
 from pushing this last night. The issue was resolved today and I should be
 able to push the final release artifacts tonight.

 On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote:

 Patrick is working on the release as we speak -- I expect it'll be out
 later tonight (US west coast) or tomorrow at the latest.

 On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote:

 Interesting, the maven artifacts were dated Dec 10th.
 However vote for RC2 closed recently:

 http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+

 Cheers

 On Dec 18, 2014, at 10:02 PM, madhu phatak phatak@gmail.com wrote:

 It’s on Maven Central already
 http://search.maven.org/#browse%7C717101892

 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com 
 vboylin1...@gmail.com wrote:

 Hi,
Dose any know when will spark 1.2 released? 1.2 has many great
 feature that we can't wait now ,-)

 Sincely
 Lin wukang


 发自网易邮箱大师



 --
 Regards,
 Madhukara Phatak
 http://www.madhukaraphatak.com





Re: JIRA Tickets for 1.6.2 Release

2014-12-18 Thread Corey Nolet
 Have you started tracking a CHANGES list yet (do we need to update
anything added back in 1.6.2)?

I did start a CHANGES file in the 1.6.2-SNAPSHOT branch. I figure after the
tickets settle down I'll just create a new one.

On Thu, Dec 18, 2014 at 2:05 PM, Christopher ctubb...@apache.org wrote:

 I triage'd some of the issues, deferring to 1.7 if they were marked with a
 fixVersion of 1.5.x or 1.6.x. I left documentation issues alone, as well as
 tests-related improvements and tasks. I commented on a few which looked
 like they were general internal improvements that weren't necessarily bugs.
 Feel free to change them to bugs if I make an incorrect choice on those.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Dec 18, 2014 at 1:06 PM, Josh Elser josh.el...@gmail.com wrote:
 
  Thanks for starting this up, Corey. Have you started tracking a CHANGES
  list yet (do we need to update anything added back in 1.6.2)?
 
  Oof, good point re semver. Let's coordinate on triaging the tickets as
  there are quite a few. On IRC? I don't want multiple people to spend time
  looking at the same issues :)
 
 
  Christopher wrote:
 
  Because we've agreed on Semver for release versioning, all the JIRAs
  marked
  for 1.6.x as something other than Bug (or maybe Task, and Test)
  should probably have 1.6.x dropped from their fixVersion.
 
  They can/should get addressed in 1.7 and later. Those currently marked
 for
  1.6.x need to be triage'd to determine if they've been labeled
 correctly,
  though.
 
  It's not that we can't improve internals in a patch release with Semver
  (so
  long as we don't alter the API)... but Semver helps focus changes to
 patch
  releases on things that fix buggy behavior.
 
  I'll do some triage later today (after some sleep) if others haven't
  gotten
  to it first.
 
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
  On Thu, Dec 18, 2014 at 12:44 AM, Corey Noletcjno...@gmail.com
 wrote:
 
  Since we've been discussing cutting an rc0 for testing before we begin
  the
  formal release process. I've moved over all the non-blocker tickets
 from
  1.6.2 to 1.6.3 [1]. Many of the tickets that moved haven't been updated
  since the 1.6.1 release. If there are tickets you feel are necessary
 for
  1.6.2, feel free to move them back and mark them as a blocker [2]. I'd
  like
  to get an rc0 out very soon- possibly in the next couple of days.
 
  [1]
 
  https://issues.apache.org/jira/issues/?jql=project%20%
  3D%20ACCUMULO%20AND%20fixVersion%20%3D%201.6.3
 
  [2]
 
  https://issues.apache.org/jira/issues/?jql=project%20%
  3D%20Accumulo%20and%20priority%20%3D%20Blocker%
  20and%20fixVersion%20%3D%201.6.2%20and%20status%20%3D%20Open
 
 
 



Re: 1.6.2 candidates

2014-12-17 Thread Corey Nolet
I'll cut one tonight

On Wed, Dec 17, 2014 at 1:52 PM, Christopher ctubb...@apache.org wrote:

 I think we could probably put together a non-voting RC0 to start testing
 with.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Tue, Dec 16, 2014 at 11:28 PM, Eric Newton eric.new...@gmail.com
 wrote:
 
  We are running 1.6.1 w/patches in production already.  I would much
 rather
  have a 1.6.2 official release.
 
  I may have temporary access to a small cluster (3-ish racks) to run some
 of
  the long running tests on bare metal.
 
  Testing sooner, rather than later is preferable.
 
 
 
 
  On Tue, Dec 16, 2014 at 7:18 PM, Corey Nolet cjno...@gmail.com wrote:
  
   I have cycles to spin the RCs- I wouldn't mind finishing the updates
 (per
   my notes) of the release documentation as well.
  
   On Tue, Dec 16, 2014 at 7:11 PM, Christopher ctubb...@apache.org
  wrote:
   
I think it'd be good to let somebody else exercise the process a bit,
   but I
can make the RCs if nobody else volunteers. My primary concern is
 that
people will have time to test.
   
   
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
   
On Tue, Dec 16, 2014 at 6:37 PM, Josh Elser josh.el...@gmail.com
   wrote:

 +1 There are lots of good bug fixes in 1.6.2 already.

 I can make some time to test, document, etc. Are you volunteering
 to
   spin
 the RCs as well?


 Christopher wrote:

 I'm thinking we should look at releasing 1.6.2 in January. I'd say
sooner,
 but I don't know if people will have time to test if we start
  putting
 together RCs this week or next.

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


   
  
 



JIRA Tickets for 1.6.2 Release

2014-12-17 Thread Corey Nolet
Since we've been discussing cutting an rc0 for testing before we begin the
formal release process. I've moved over all the non-blocker tickets from
1.6.2 to 1.6.3 [1]. Many of the tickets that moved haven't been updated
since the 1.6.1 release. If there are tickets you feel are necessary for
1.6.2, feel free to move them back and mark them as a blocker [2]. I'd like
to get an rc0 out very soon- possibly in the next couple of days.

[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ACCUMULO%20AND%20fixVersion%20%3D%201.6.3

[2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20Accumulo%20and%20priority%20%3D%20Blocker%20and%20fixVersion%20%3D%201.6.2%20and%20status%20%3D%20Open


build.sh script still being used?

2014-12-17 Thread Corey Nolet
I'm working on updating the Making a Release page on our website [1] with
more detailed instructions on the steps involved. Create the candidate
section references the build.sh script and I'm contemplating just removing
it altogether since it seems like, after quick discussions with a few
individuals, maven is mostly being called directly. I don't want to remove
this, however, if there are others in the community who still feel it is
necessary.

The commands that are present in the script are going to be well documented
on the page already. Do we need to keep the script around?


[1] http://accumulo.apache.org/releasing.html


Re: 1.6.2 candidates

2014-12-16 Thread Corey Nolet
I have cycles to spin the RCs- I wouldn't mind finishing the updates (per
my notes) of the release documentation as well.

On Tue, Dec 16, 2014 at 7:11 PM, Christopher ctubb...@apache.org wrote:

 I think it'd be good to let somebody else exercise the process a bit, but I
 can make the RCs if nobody else volunteers. My primary concern is that
 people will have time to test.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Tue, Dec 16, 2014 at 6:37 PM, Josh Elser josh.el...@gmail.com wrote:
 
  +1 There are lots of good bug fixes in 1.6.2 already.
 
  I can make some time to test, document, etc. Are you volunteering to spin
  the RCs as well?
 
 
  Christopher wrote:
 
  I'm thinking we should look at releasing 1.6.2 in January. I'd say
 sooner,
  but I don't know if people will have time to test if we start putting
  together RCs this week or next.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 



Spark eating exceptions in multi-threaded local mode

2014-12-16 Thread Corey Nolet
I've been running a job in local mode using --master local[*] and I've
noticed that, for some reason, exceptions appear to get eaten- as in, I
don't see them. If i debug in my IDE, I'll see that an exception was thrown
if I step through the code but if I just run the application, it appears
everything completed but i know a bunch of my jobs did not actually ever
get run.

The exception is happening in a map stage.

Is there a special way this is supposed to be handled? Am I missing a
property somewhere that allows these to be bubbled up?


Re: accumulo join order count,sum,avg

2014-12-14 Thread Corey Nolet
A good example of the count/sum/average can be found in our StatsCombiner
example [1]. Joins are a complicated one- your implementation of joins will
really depend on your data set and the expected sizes of each side of the
join. You can obviously always resort to joining data together on different
tablets using Mapreduce or Spark but you may be able to simulate more
real-time joins if your data allows. Ordering is kind of the same here-
depending on your data, you could use specialized indexes that take
advantage of the Accumulo keys already being sorted.

If you can provide some more detail about your data set, we may be able to
provide more specific examples on how to accomplish this.




[1]  https://accumulo.apache.org/1.6/examples/combiner.html

On Sun, Dec 14, 2014 at 9:02 PM, panqing...@163.com panqing...@163.com
wrote:

 Accumulo implementation of the join order count sum AVG how to achieve
 this?



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/accumulo-join-order-count-sum-avg-tp12568.html
 Sent from the Developers mailing list archive at Nabble.com.



Re: accumulo Scanner

2014-12-11 Thread Corey Nolet
You're going to want to use WholeRowIterator.decodeRow(entry.getKey(),
entry.getValue()) for that one. You can do:

for(EntryKey,Value entry : scanner) {
   for(EntryKey,Value actualEntry :
WholeRowIterator.decodeRow(entry.getKey(), entry.getValue()).entrySet()) {
// do something with actualEntry
   }
}

On Thu, Dec 11, 2014 at 10:24 PM, panqing...@163.com panqing...@163.com
wrote:

 I try to use the WholeRowIterator, the same rowkey data into a line, Now,
 Value contains ColumnFamily, ColumnQualifier, value,but the value of Value
 should be how to analysis?

  for (EntryKey, Value entry : scanner) {
 log.info( + entry.getKey() + , + entry.getValue());
 }




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/accumulo-Scanner-tp12506p12552.html
 Sent from the Developers mailing list archive at Nabble.com.



Re: Accumulo Working Day

2014-12-09 Thread Corey Nolet
Also talked a little about Christopher's working on a new API design:
https://github.com/ctubbsii/accumulo/blob/ACCUMULO-2589/

On Tue, Dec 9, 2014 at 11:56 PM, Josh Elser josh.el...@gmail.com wrote:

 Just so you don't think I forgot, there wasn't really much to report
 today. Lots of friendly banter among everyone.

 The notable discussion was likely Don Miner stopping by and the collective
 trying to brainstorm suggestions as to who would be a good candidate for a
 high-profile keynote speaker for Accumulo Summit 2015 :)

 We also talked a little bit about metrics (with the recent support for
 Hadoop metrics2 added) which helped bring some other devs up to speed who
 hadn't looked at what such support really means.

 Let me know if I forgot anything other attendees.


 Josh Elser wrote:

 I'd be happy to. Not too much discussion yet, but if we talk about
 anything that doesn't end up on JIRA or elsewhere, I'll make sure it
 gets posted here.

 - Josh

 Mike Drob wrote:

 For those of us who were unable to attend, can we get a summary of what
 happened? I'd be curious to know if anything particularly novel came
 out of
 this collaboration!

 On Mon, Dec 8, 2014 at 4:06 PM, Jason Pyeronjpye...@pdinc.us wrote:

  If you are meeting near Ft. Meade I would like to drop off thank you
 doughnuts.

 -Jason

  -Original Message-
 From: Keith Turner
 Sent: Wednesday, December 03, 2014 14:00

 Christopher, Eric, Josh, Billie, Mike, and I are meeting on
 Dec 9 to work
 on Accumulo together for the day in Central MD. If you are
 interested in
 joining us, email me directly. We are meeting in a small
 conf room, so
 space is limited.

 Keith

  --
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 - -
 - Jason Pyeron PD Inc. http://www.pdinc.us -
 - Principal Consultant 10 West 24th Street #100 -
 - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 -
 - -
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 This message is copyright PD Inc, subject to license 20080407P00.






Possible typo in the Hadoop Latest Stable Release Page

2014-12-09 Thread Corey Nolet
I'm looking @ this page: http://hadoop.apache.org/docs/stable/

Is it  a typo that Hadoop 2.6.0 is based on 2.4.1?

Thanks.


Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
Reading the documentation a little more closely, I'm using the wrong
terminology. I'm using stages to refer to what spark is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote:

 I've read in the documentation that RDDs can be run concurrently when
 submitted in separate threads. I'm curious how the scheduler would handle
 propagating these down to the tasks.

 I have 3 RDDs:
 - one RDD which loads some initial data, transforms it and caches it
 - two RDDs which use the cached RDD to provide reports

 I'm trying to figure out how the resources will be scheduled to perform
 these stages if I were to concurrently run the two RDDs that depend on the
 first RDD. Would the two RDDs run sequentially? Will they both run @ the
 same time and be smart about how they are caching?

 Would this be a time when I'd want to use Tachyon instead and run this as
 2 separate physical jobs: one to place the shared data in the RAMDISK and
 one to run the two dependent RDDs concurrently? Or would it even be best in
 that case to run 3 completely separate jobs?

 We're planning on using YARN so there's 2 levels of scheduling going on.
 We're trying to figure out the best way to utilize the resources so that we
 are fully saturating the system and making sure there's constantly work
 being done rather than anything spinning gears waiting on upstream
 processing to occur (in mapreduce, we'd just submit a ton of jobs and have
 them wait in line).



Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.

I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached RDD to provide reports

I'm trying to figure out how the resources will be scheduled to perform
these stages if I were to concurrently run the two RDDs that depend on the
first RDD. Would the two RDDs run sequentially? Will they both run @ the
same time and be smart about how they are caching?

Would this be a time when I'd want to use Tachyon instead and run this as 2
separate physical jobs: one to place the shared data in the RAMDISK and one
to run the two dependent RDDs concurrently? Or would it even be best in
that case to run 3 completely separate jobs?

We're planning on using YARN so there's 2 levels of scheduling going on.
We're trying to figure out the best way to utilize the resources so that we
are fully saturating the system and making sure there's constantly work
being done rather than anything spinning gears waiting on upstream
processing to occur (in mapreduce, we'd just submit a ton of jobs and have
them wait in line).


Re: [VOTE] ACCUMULO-3176

2014-12-01 Thread Corey Nolet
+1 in case it wasn't inferred from my previous comments. As Josh stated,
I'm still confused how the veto still holds technical justification- the
changes being made aren't removing methods from the public API.

On Mon, Dec 1, 2014 at 3:42 PM, Josh Elser josh.el...@gmail.com wrote:

 I still don't understand what could even be changed to help you retract
 your veto.

 A number of people here have made suggestions about altering the changes
 to the public API WRT to the major version. I think Brian was the most
 recent, but I recall asking the same question on the original JIRA issue
 too.


 Sean Busbey wrote:

 I'm not sure what questions weren't previously answered in my
 explanations,
 could you please restate which ever ones you want clarification on?

 The vote is closed and only has 2 binding +1s. That means it fails under
 consensus rules regardless of my veto, so the issue seems moot.

 On Mon, Dec 1, 2014 at 1:59 PM, Christopherctubb...@apache.org  wrote:

  So, it's been 5 days since last activity here, and there are still some
 questions/requests for response left unanswered regarding the veto. I'd
 really like a response to these questions so we can put this issue to
 rest.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Wed, Nov 26, 2014 at 1:21 PM, Christopherctubb...@apache.org
 wrote:

  On Wed, Nov 26, 2014 at 11:57 AM, Sean Busbeybus...@cloudera.com

 wrote:

 Responses to a few things below.


 On Tue, Nov 25, 2014 at 2:56 PM, Brian Lossbfl...@praxiseng.com

 wrote:

 Aren’t API-breaking changes allowed in 1.7? If this change is ok for

 2.0,

 then what is the technical reason why it is ok for version 2.0 but

 vetoed

 for version 1.7?

  On Nov 25, 2014, at 3:48 PM, Sean Busbeybus...@cloudera.com

 wrote:


 How about if we push this change in the API out to the client

 reworking

 in

 2.0? Everything will break there anyways so users will already have

 to

 deal

 with the change.

 As I previously mentioned, API breaking changes are allowed on major
 revisions. Currently, 1.7 is a major revision (and I have consistently
 argued for it to remain classified as such). That doesn't mean we
 shouldn't
 consider the cost to end users of making said changes.

 There is no way to know that there won't be a 1.8 or later version
 after
 1.7 and before 2.0. We already have consensus to do a sweeping overhaul

 of

 the API for that later release and have had that consensus for quite

 some

 time. Since users will already have to deal with that breakage in 2.0 I
 don't see this improvement as worth making them deal with changes prior

 to

 that.


  So, are you arguing for no more API additions until 2.0? Because,
 that's
 what it sounds like. As is, your general objection to the API seems to
 be
 independent of this change, but reflective of an overall policy for API
 additions. Please address why your argument applies to this specific
 change, and wouldn't to other API additions. Otherwise, this seems to be

 a

 case of special pleading.

 Please address the fact that there is no breakage here, and we can
 ensure
 that there won't be any more removal (except in exceptional

 circumstances)

 of deprecated APIs until 2.0 to ease changes. (I actually think that

 would

 be a very reasonable policy to adopt today.) In addition, I fully expect
 that 2.0 will be fully compatible with 1.7, and will also not introduce

 any

 breakage except removal of things already deprecated in 1.7. If we make
 this change without marking the previous createTable methods as

 deprecated,

 this new API addition AND the previous createTable API will still be
 available in 2.0 (as deprecated), and will not be removed until 3.0.

 You have also previously argued for more intermediate releases between
 major releases. Please explain how you see omitting this API addition is
 compatible with that goal. Please also explain why, if you consider 1.7

 to

 be a major (expected) release, why such an addition would not be
 appropriate, but would be appropriate for a future major release (2.0).


  On Tue, Nov 25, 2014 at 4:18 PM, Christopherctubb...@apache.org

 wrote:

 On Tue, Nov 25, 2014 at 5:07 PM, Bill Havanki

 bhava...@clouderagovt.com

 wrote:

  In my interpretation of Sean's veto, what he says is bad - using the

 ASF

 word here - is not that the change leaves the property update

 unsolved.

 It's that it changes the API without completely solving it. The

 purpose

 of

 the change is not explicitly to alter the API, but it does cause

 that

 to

 happen, and it is that aspect that is bad (with the given

 justification).

 I just want to clarify my reasoning.

 That is my current understanding, as well. Additionally, it seems to

 me

 that the two things that make it bad is that it A) doesn't achieve

 an

 additional purpose (which can be achieved with additional work), and

 that

 B) it deprecates existing methods (which can be avoided). Unless

 there's

 some other reason that 

Re: Can MiniAccumuloCluster reuse directory?

2014-11-30 Thread Corey Nolet
I had a ticket for that awhile back and I don't believe it was ever
completed. By default, it wants to dump out new config files for
everything- have it reusing a config file would mean not re-initializing
each time and reusing the same instance id + rfiles.
ACCUMULO-1378 was the it and it looks like it's still open.

https://issues.apache.org/jira/browse/ACCUMULO-1378?filter=-1

On Sun, Nov 30, 2014 at 7:09 PM, Josh Elser josh.el...@gmail.com wrote:

 Not sure if it already can. I believe it would be trivial to do so.


 David Medinets wrote:

 When starting MiniAccumuloCluster, can I point to a directory used by
 a previous MiniAccumuloCluster?

 If not, would it be infeasible to do so?




[jira] [Commented] (ACCUMULO-3371) Allow user to set Zookeeper port in MiniAccumuloCluster

2014-11-29 Thread Corey Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228916#comment-14228916
 ] 

Corey Nolet commented on ACCUMULO-3371:
---

David,

http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/minicluster/MiniAccumuloConfig.html

I remember adding this in the 1.6 line. Is that not what you were looking
for?

On Sat, Nov 29, 2014 at 10:32 AM, David Medinets (JIRA) j...@apache.org



 Allow user to set Zookeeper port in MiniAccumuloCluster
 ---

 Key: ACCUMULO-3371
 URL: https://issues.apache.org/jira/browse/ACCUMULO-3371
 Project: Accumulo
  Issue Type: Improvement
  Components: mini
Affects Versions: 1.6.1
 Environment: Ubuntu inside Docker
Reporter: David Medinets
Priority: Minor

 I'm experimenting with Docker to use MiniAccumuloCluster inside containers. 
 However, I ran into an issue that the Zookeeper port is randomly assigned and 
 therefore can't be exposed to the host machine running the Docker container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Bylaws Change - Majority Approval for Code Changes

2014-11-26 Thread Corey Nolet
Jeremy,

The PMC boards in ASF are re

On Wed, Nov 26, 2014 at 1:18 PM, Jeremy Kepner kep...@ll.mit.edu wrote:

 To be effective, most boards need to be small (~5 people) and not involved
 with day-to-day.
 Ideally, if someone says let's bring this to the board for a decision the
 collective response should be no, let's figure out a compromise.

 On Wed, Nov 26, 2014 at 12:26:09PM -0600, Mike Drob wrote:
  Jeremey, FWIW I believe that the PMC is supposed to be that board. In our
  case, it happens to also be the same population as the committers,
 because
  it was suggested that the overlap leads to a healthier community overall.
 
  On Wed, Nov 26, 2014 at 12:02 PM, Jeremy Kepner kep...@ll.mit.edu
 wrote:
 
   -1 (I vote to keep current consensus approach)
  
   An alternative method for resolution would be to setup an
   elected (or appointed) advisory board of a small number of folks whose
   job it is to look out for the long-term health and strategy of
 Accumulo.
   This board could then
   be appealed to on the rare occassions when consensus over important
   long-term issues
   cannot be achieved.  Just the presence of such a board often has the
 effect
   encouraging productive compromise amongst participants.
  
  
  
   On Wed, Nov 26, 2014 at 05:33:40PM +, dlmar...@comcast.net wrote:
   
It was suggested in the ACCUMULO-3176 thread that code changes
 should be
   majority approval instead of consensus approval. I'd like to explore
 this
   idea as it might keep the voting email threads less verbose and leave
 the
   discussion and consensus building to the comments in JIRA. Thoughts?
  



Re: Unsubscribe

2014-11-26 Thread Corey Nolet
send an email to user-unsubscr...@hadoop.apache.org to unsubscribe.

On Wed, Nov 26, 2014 at 3:08 PM, Li Chen ahli1...@gmail.com wrote:

 Please unsubscribe me, too.
 Li

 On Wed, Nov 26, 2014 at 3:03 PM, Sufi Nawaz s...@eaiti.com wrote:

 Please suggest how to unsubscribe from this list.

 Thank you,


 *Sufi Nawaz *Application Innovator

 e: s...@eaiti.com */ *w: www.eaiti.com
 o: (571) 306-4683 */ *c: (940) 595-1285





Re: [VOTE] ACCUMULO-3176

2014-11-25 Thread Corey Nolet
 I could understand the veto if the change actually caused one of the
issues mentioned above or the issue that Sean is raising. But it does not.
The eventual consistency of property updates was an issue before this
change and continues to be an issue. This JIRA did not attempt to address
the property update issue.

You said this before I could and I couldn't agree more.

 Everything will break there anyways so users will already have to deal
with the change.

I didn't see any methods removed from the API but I could be missing
something. I just see a new create() method added.


On Tue, Nov 25, 2014 at 3:56 PM, Brian Loss bfl...@praxiseng.com wrote:

 Aren’t API-breaking changes allowed in 1.7? If this change is ok for 2.0,
 then what is the technical reason why it is ok for version 2.0 but vetoed
 for version 1.7?

  On Nov 25, 2014, at 3:48 PM, Sean Busbey bus...@cloudera.com wrote:
 
 
  How about if we push this change in the API out to the client reworking
 in
  2.0? Everything will break there anyways so users will already have to
 deal
  with the change.
 
  --
  Sean




Re: Configuring custom input format

2014-11-25 Thread Corey Nolet
I was wiring up my job in the shell while i was learning Spark/Scala. I'm
getting more comfortable with them both now so I've been mostly testing
through Intellij with mock data as inputs.

I think the problem lies more on Hadoop than Spark as the Job object seems
to check it's state and throw an exception when the toString() method is
called before the Job has physically been submitted.

On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 How are you creating the object in your Scala shell? Maybe you can write a
 function that directly returns the RDD, without assigning the object to a
 temporary variable.

 Matei

 On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote:

 The closer I look @ the stack trace in the Scala shell, it appears to be
 the call to toString() that is causing the construction of the Job object
 to fail. Is there a ways to suppress this output since it appears to be
 hindering my ability to new up this object?

 On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to use a custom input format with
 SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting
 up the configuration file via the static methods on input formats that
 require a Hadoop Job object is proving to be difficult.

 Trying to new up my own Job object with the
 SparkContext.hadoopConfiguration is throwing the exception on line 283 of
 this grepcode:


 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job

 Looking in the SparkContext code, I'm seeing that it's newing up Job
 objects just fine using nothing but the configuraiton. Using
 SparkContext.textFile() appears to be working for me. Any ideas? Has anyone
 else run into this as well? Is it possible to have a method like
 SparkContext.getJob() or something similar?

 Thanks.






[jira] [Commented] (ACCUMULO-1817) Create a monitoring bridge similar to Hadoop's GangliaContext that can allow easy pluggable support

2014-11-25 Thread Corey Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224975#comment-14224975
 ] 

Corey Nolet commented on ACCUMULO-1817:
---

Awesome! Given that people have been using the jmx wrapper for ganglia
metrics, I didn't know if this was still going to be important for others.
I can say it's still something I would like to see.




 Create a monitoring bridge similar to Hadoop's GangliaContext that can allow 
 easy pluggable support
 ---

 Key: ACCUMULO-1817
 URL: https://issues.apache.org/jira/browse/ACCUMULO-1817
 Project: Accumulo
  Issue Type: New Feature
Reporter: Corey J. Nolet
Assignee: Billie Rinaldi
  Labels: proposed
 Fix For: 1.7.0


 We currently expose JMX and it's possible (with external code) to bridge the 
 JMX to solutions like Ganglia. It would be ideal if the integration were 
 native and pluggable.
 Turns out that Hadoop (hdfs, mapred) and HBase has direct metrics reporting 
 to Ganglia through some nice code provided in Hadoop.
 Look into the GangliaContext to see if we can implement Ganglia metrics 
 reporting by Accumulo configuration alone.
 References: http://wiki.apache.org/hadoop/GangliaMetrics, 
 http://hbase.apache.org/metrics.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Job object toString() is throwing an exception

2014-11-25 Thread Corey Nolet
I was playing around in the Spark shell and newing up an instance of Job
that I could use to configure the inputformat for a job. By default, the
Scala shell println's the result of every command typed. It throws an
exception when it printlns the newly created instance of Job because it
looks like it's setting a state upon allocation and it's not happy with the
state that it's in when toString() is called before the job is submitted.

I'm using Hadoop 2.5.1. I don't see any tickets for this for 2.6. Has
anyone else ran into this?


Re: Job object toString() is throwing an exception

2014-11-25 Thread Corey Nolet
Here's the stack trace. I was going to file a ticket for this but wanted to
check on the user list first to make sure there wasn't already a fix in the
works. It has to do with the Scala shell doing a toString() each time a
command is typed in. The stack trace stops the instance of Job from ever
being assigned.


scala val job = new org.apache.hadoop.mapreduce.Job

warning: there were 1 deprecation warning(s); re-run with -deprecation for
details

java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING

at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)

at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)

at
scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)

at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)

at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)

at .init(console:10)

at .clinit(console)

at $print(console)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)

at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)

at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)

at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)

at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)

at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)

at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)

at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



On Tue, Nov 25, 2014 at 9:39 PM, Rohith Sharma K S 
rohithsharm...@huawei.com wrote:

  Could you give error message or stack trace?



 *From:* Corey Nolet [mailto:cjno...@gmail.com]
 *Sent:* 26 November 2014 07:54
 *To:* user@hadoop.apache.org
 *Subject:* Job object toString() is throwing an exception



 I was playing around in the Spark shell and newing up an instance of Job
 that I could use to configure the inputformat for a job. By default, the
 Scala shell println's the result of every command typed. It throws an
 exception when it printlns the newly created instance of Job because it
 looks like it's setting a state upon allocation and it's not happy with the
 state that it's in when toString() is called before the job is submitted.



 I'm using Hadoop 2.5.1. I don't see any tickets for this for 2.6. Has
 anyone else ran into this?



Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Corey Nolet
I was actually about to post this myself- I have a complex join that could
benefit from something like a GroupComparator vs having to do multiple
grouyBy operations. This is probably the wrong thread for a full discussion
on this but I didn't see a JIRA ticket for this or anything similar- any
reasons why this would not make sense given Spark's design?

On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com wrote:

 Thanks Patrick.

 I've been testing some 1.2 features, looks good so far.
 I have some example code that I think will be helpful for certain MR-style
 use cases (secondary sort).
 Can I still add that to the 1.2 documentation, or is that frozen at this
 point?



 -
 --
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: unsubscribe

2014-11-18 Thread Corey Nolet
Abdul,

Please send an email to user-unsubscr...@spark.apache.org

On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote:





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Contribute Examples/Exercises

2014-11-14 Thread Corey Nolet
Mike  David,

Are you +1 for contributing the examples or +1 for moving the examples out
into separate repos?

On Fri, Nov 14, 2014 at 12:52 PM, David Medinets david.medin...@gmail.com
wrote:

 +1
 On Nov 14, 2014 11:18 AM, Keith Turner ke...@deenlo.com wrote:

  On Wed, Nov 12, 2014 at 4:52 PM, Corey Nolet cjno...@gmail.com wrote:
 
   Josh,
  
My worry with a contrib module is that, historically, code which goes
   moves to a contrib is just one step away from the grave.
  
   You do have a good point. My hope was that this could be the beginning
 of
   our changing history so that we could begin to encourage the community
 to
   contribute their own source directly and give them an outlet for doing
  so.
   I understand that's also the intent of hosting open source repos under
  ASF
   to begin with- so I'm partial to either outcome.
  
I think there's precedence for keeping them in core (as Christopher
 had
   mentioned, next to examples/simple) which would benefit people
 externally
   (more how do I do X examples) and internally (keep devs honest about
  how
   our APIs are implemented).
  
   I would think that would just require keeping the repos up to date as
   versions change so they wouldn't get out of date and possibly releasing
   them w/ our other releases.
  
  
   Wherever they end up living, thank you Adam for the contributions!
  
 
  I'll 2nd that.
 
  For the following reasons, I think it might be nice to move existing
  examples out of core into their own git repo(s).
 
   * Examples would be based on released version of Accumulo
   * Examples could easily be built w/o building all of Accumulo
   * As Sean said, this would keep us honest
   * The examples poms would serve as examples more than they do when part
 of
  Accumulo build
   * Less likely to use non public APIs in examples
 
 
  
  
  
   On Wed, Nov 12, 2014 at 2:54 PM, Josh Elser josh.el...@gmail.com
  wrote:
  
My worry with a contrib module is that, historically, code which goes
moves to a contrib is just one step away from the grave. I think
  there's
precedence for keeping them in core (as Christopher had mentioned,
 next
   to
examples/simple) which would benefit people externally (more how do
 I
  do
X examples) and internally (keep devs honest about how our APIs are
implemented).
   
Bringing the examples into the core also encourages us to grow the
community which has been stagnant with respect to new committers for
   about
9 months now.
   
   
Corey Nolet wrote:
   
+1 for adding the examples to contrib.
   
I was, myself, reading over this email wondering how a set of 11
   separate
examples on the use of Accumulo would fit into the core codebase-
especially as more are contributed over tinme. I like the idea of
  giving
community members an outlet for contributing examples that they've
  built
so
that we can continue to foster that without having to fit them in
 the
   core
codebase. It just seems more maintainable.
   
   
On Wed, Nov 12, 2014 at 2:19 PM, Josh Elserjosh.el...@gmail.com
   wrote:
   
 I'll take that as you disagree with my consideration of
  substantial.
Thanks.
   
   
Mike Drob wrote:
   
 The proposed contribution is a collection of 11 examples. It's
  clearly
non-trivial, which is probably enough to be considered
 substantial
   
On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com
 
wrote:
   
   
 Sean Busbey wrote:
   
  On Wed, Nov 12, 2014 at 12:31 PM, Josh Elser
  josh.el...@gmail.com
   
wrote:
   
   Personally, I didn't really think that this contribution was
 in
   the
   
 spirit
of what the new codebase adoption guidelines were meant to
 cover.
   
Some extra examples which leverage what Accumulo already does
  seems
more
like improvements for new Accumulo users than anything else.
   
   
   It's content developed out side of the project list. That's
  all
   it
   
 takes to
require the trip through the Incubator checks as far as the ASF
guidelines
are concerned.
   
   
   
   From http://incubator.apache.org/ip-clearance/index.html
   

  From time to time, an external codebase is brought into the ASF
   that
is
not a separate incubating project but still represents a
  substantial
contribution that was not developed within the ASF's source
 control
system
and on our public mailing lists.

   
Not to look a gift-horse in the mouth (it is great work), but I
  don't
see
these examples as substantial. I haven't found guidelines yet
  that
better
clarify the definition of substantial.
   
   
   
   
  
 



Spark Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
the current stable Hadoop 2.x, would it make sense for us to update the
poms?


Re: Spark Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like
you've mentioned. What prompted me to write this email was that I did not
see any documentation that told me Hadoop 2.5.1 was officially supported by
Spark (i.e. community has been using it, any bugs are being fixed, etc...).
It builds, tests pass, etc... but there could be other implications that I
have not run into based on my own use of the framework.

If we are saying that the standard procedure is to build with the
hadoop-2.4 profile and override the -Dhadoop.version property, should we
provide that on the build instructions [1] at least?

[1] http://spark.apache.org/docs/latest/building-with-maven.html

On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote:

 I don't think it's necessary. You're looking at the hadoop-2.4
 profile, which works with anything = 2.4. AFAIK there is no further
 specialization needed beyond that. The profile sets hadoop.version to
 2.4.0 by default, but this can be overridden.

 On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet cjno...@gmail.com wrote:
  I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
  the current stable Hadoop 2.x, would it make sense for us to update the
  poms?



Re: Contribute Examples/Exercises

2014-11-12 Thread Corey Nolet
+1 for adding the examples to contrib.

I was, myself, reading over this email wondering how a set of 11 separate
examples on the use of Accumulo would fit into the core codebase-
especially as more are contributed over tinme. I like the idea of giving
community members an outlet for contributing examples that they've built so
that we can continue to foster that without having to fit them in the core
codebase. It just seems more maintainable.


On Wed, Nov 12, 2014 at 2:19 PM, Josh Elser josh.el...@gmail.com wrote:

 I'll take that as you disagree with my consideration of substantial.
 Thanks.


 Mike Drob wrote:

 The proposed contribution is a collection of 11 examples. It's clearly
 non-trivial, which is probably enough to be considered substantial

 On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com
 wrote:


 Sean Busbey wrote:

  On Wed, Nov 12, 2014 at 12:31 PM, Josh Elserjosh.el...@gmail.com
 wrote:

   Personally, I didn't really think that this contribution was in the

 spirit
 of what the new codebase adoption guidelines were meant to cover.

 Some extra examples which leverage what Accumulo already does seems
 more
 like improvements for new Accumulo users than anything else.


   It's content developed out side of the project list. That's all it

 takes to
 require the trip through the Incubator checks as far as the ASF
 guidelines
 are concerned.



   From http://incubator.apache.org/ip-clearance/index.html

 
  From time to time, an external codebase is brought into the ASF that is
 not a separate incubating project but still represents a substantial
 contribution that was not developed within the ASF's source control
 system
 and on our public mailing lists.
 

 Not to look a gift-horse in the mouth (it is great work), but I don't see
 these examples as substantial. I haven't found guidelines yet that
 better
 clarify the definition of substantial.





Re: Contribute Examples/Exercises

2014-11-12 Thread Corey Nolet
Josh,

 My worry with a contrib module is that, historically, code which goes
moves to a contrib is just one step away from the grave.

You do have a good point. My hope was that this could be the beginning of
our changing history so that we could begin to encourage the community to
contribute their own source directly and give them an outlet for doing so.
I understand that's also the intent of hosting open source repos under ASF
to begin with- so I'm partial to either outcome.

 I think there's precedence for keeping them in core (as Christopher had
mentioned, next to examples/simple) which would benefit people externally
(more how do I do X examples) and internally (keep devs honest about how
our APIs are implemented).

I would think that would just require keeping the repos up to date as
versions change so they wouldn't get out of date and possibly releasing
them w/ our other releases.


Wherever they end up living, thank you Adam for the contributions!



On Wed, Nov 12, 2014 at 2:54 PM, Josh Elser josh.el...@gmail.com wrote:

 My worry with a contrib module is that, historically, code which goes
 moves to a contrib is just one step away from the grave. I think there's
 precedence for keeping them in core (as Christopher had mentioned, next to
 examples/simple) which would benefit people externally (more how do I do
 X examples) and internally (keep devs honest about how our APIs are
 implemented).

 Bringing the examples into the core also encourages us to grow the
 community which has been stagnant with respect to new committers for about
 9 months now.


 Corey Nolet wrote:

 +1 for adding the examples to contrib.

 I was, myself, reading over this email wondering how a set of 11 separate
 examples on the use of Accumulo would fit into the core codebase-
 especially as more are contributed over tinme. I like the idea of giving
 community members an outlet for contributing examples that they've built
 so
 that we can continue to foster that without having to fit them in the core
 codebase. It just seems more maintainable.


 On Wed, Nov 12, 2014 at 2:19 PM, Josh Elserjosh.el...@gmail.com  wrote:

  I'll take that as you disagree with my consideration of substantial.
 Thanks.


 Mike Drob wrote:

  The proposed contribution is a collection of 11 examples. It's clearly
 non-trivial, which is probably enough to be considered substantial

 On Wed, Nov 12, 2014 at 12:58 PM, Josh Elserjosh.el...@gmail.com
 wrote:


  Sean Busbey wrote:

   On Wed, Nov 12, 2014 at 12:31 PM, Josh Elserjosh.el...@gmail.com

 wrote:

Personally, I didn't really think that this contribution was in the

  spirit
 of what the new codebase adoption guidelines were meant to cover.

 Some extra examples which leverage what Accumulo already does seems
 more
 like improvements for new Accumulo users than anything else.


It's content developed out side of the project list. That's all it

  takes to
 require the trip through the Incubator checks as far as the ASF
 guidelines
 are concerned.



From http://incubator.apache.org/ip-clearance/index.html

 
   From time to time, an external codebase is brought into the ASF that
 is
 not a separate incubating project but still represents a substantial
 contribution that was not developed within the ASF's source control
 system
 and on our public mailing lists.
 

 Not to look a gift-horse in the mouth (it is great work), but I don't
 see
 these examples as substantial. I haven't found guidelines yet that
 better
 clarify the definition of substantial.






Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().

Looking @ the code- I see an inferSchema() method which gets called under
the hood. I also see an experimental jsonRDD() method which has a
sampleRatio.

My dataset is extremely large and i've got a lot of processing to do on it-
it's really not a luxury to be able to loop through it twice. I also know
that the SQL I am going to be running matches at least some of the
records contained in the files. Would it make sense or be possible with the
current execution plan design to be able to bypass inferring the schema for
purposes of speed?

Though I haven't really dug further in the code than the implementations of
the client API methods that I'm calling, I am wondering if there's a way to
theoretically process the data without pre-determining the schema. I also
don't have the luxury of giving the full schema ahead of time because i may
want to do a select * from table but I may only know 2 or 3 of the actual
json keys that are available.

Thanks.


Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
+1 (non-binding) [for original process proposal]

Greg, the first time I've seen the word ownership on this thread is in
your message. The first time the word lead has appeared in this thread is
in your message as well. I don't think that was the intent. The PMC and
Committers have a responsibility to the community to make sure that their
patches are being reviewed and committed. I don't see in Apache's
recommended bylaws anywhere that says establishing responsibility on paper
for specific areas cannot be taken on by different members of the PMC.
What's been proposed looks, to me, to be an empirical process and it looks
like it has pretty much a consensus from the side able to give binding
votes. I don't at all this model establishes any form of ownership over
anything. I also don't see in the process proposal where it mentions that
nobody other than the persons responsible for a module can review or commit
code.

In fact, I'll go as far as to say that since Apache is a meritocracy, the
people who have been aligned to the responsibilities probably were aligned
based on some sort of meric, correct? Perhaps we could dig in and find out
for sure... I'm still getting familiar with the Spark community myself.



On Thu, Nov 6, 2014 at 7:28 PM, Patrick Wendell pwend...@gmail.com wrote:

 In fact, if you look at the subversion commiter list, the majority of
 people here have commit access only for particular areas of the
 project:

 http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS

 On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Greg,
 
  Regarding subversion - I think the reference is to partial vs full
  committers here:
  https://subversion.apache.org/docs/community-guide/roles.html
 
  - Patrick
 
  On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote:
  -1 (non-binding)
 
  This is an idea that runs COMPLETELY counter to the Apache Way, and is
  to be severely frowned up. This creates *unequal* ownership of the
  codebase.
 
  Each Member of the PMC should have *equal* rights to all areas of the
  codebase until their purview. It should not be subjected to others'
  ownership except throught the standard mechanisms of reviews and
  if/when absolutely necessary, to vetos.
 
  Apache does not want leads, benevolent dictators or assigned
  maintainers, no matter how you may dress it up with multiple
  maintainers per component. The fact is that this creates an unequal
  level of ownership and responsibility. The Board has shut down
  projects that attempted or allowed for Leads. Just a few months ago,
  there was a problem with somebody calling themself a Lead.
 
  I don't know why you suggest that Apache Subversion does this. We
  absolutely do not. Never have. Never will. The Subversion codebase is
  owned by all of us, and we all care for every line of it. Some people
  know more than others, of course. But any one of us, can change any
  part, without being subjected to a maintainer. Of course, we ask
  people with more knowledge of the component when we feel
  uncomfortable, but we also know when it is safe or not to make a
  specific change. And *always*, our fellow committers can review our
  work and let us know when we've done something wrong.
 
  Equal ownership reduces fiefdoms, enhances a feeling of community and
  project ownership, and creates a more open and inviting project.
 
  So again: -1 on this entire concept. Not good, to be polite.
 
  Regards,
  Greg Stein
  Director, Vice Chairman
  Apache Software Foundation
 
  On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.
 
  As background on this, Spark has grown a lot since joining Apache.
 We've had over 80 contributors/month for the past 3 months, which I believe
 makes us the most active project in contributors/month at Apache, as well
 as over 500 patches/month. The codebase has also grown significantly, with
 new libraries for SQL, ML, graphs and more.
 
  In this kind of large project, one common way to scale development is
 to assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.
 
  IMO, adopting this model would have two 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
PMC [1] is responsible for oversight and does not designate partial or full
committer. There are projects where all committers become PMC and others
where PMC is reserved for committers with the most merit (and willingness
to take on the responsibility of project oversight, releases, etc...).
Community maintains the codebase through committers. Committers to mentor,
roll in patches, and spread the project throughout other communities.

Adding someone's name to a list as a maintainer is not a barrier. With a
community as large as Spark's, and myself not being a committer on this
project, I see it as a welcome opportunity to find a mentor in the areas in
which I'm interested in contributing. We'd expect the list of names to grow
as more volunteers gain more interest, correct? To me, that seems quite
contrary to a barrier.

[1] http://www.apache.org/dev/pmc.html


On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 So I don't understand, Greg, are the partial committers committers, or are
 they not? Spark also has a PMC, but our PMC currently consists of all
 committers (we decided not to have a differentiation when we left the
 incubator). I see the Subversion partial committers listed as committers
 on https://people.apache.org/committers-by-project.html#subversion, so I
 assume they are committers. As far as I can see, CloudStack is similar.

 Matei

  On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote:
 
  Partial committers are people invited to work on a particular area, and
 they do not require sign-off to work on that area. They can get a sign-off
 and commit outside that area. That approach doesn't compare to this
 proposal.
 
  Full committers are PMC members. As each PMC member is responsible for
 *every* line of code, then every PMC member should have complete rights to
 every line of code. Creating disparity flies in the face of a PMC member's
 responsibility. If I am a Spark PMC member, then I have responsibility for
 GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And
 interposing a barrier inhibits my responsibility to ensure GraphX is
 designed, maintained, and delivered to the Public.
 
  Cheers,
  -g
 
  (and yes, I'm aware of COMMITTERS; I've been changing that file for the
 past 12 years :-) )
 
  On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:
  In fact, if you look at the subversion commiter list, the majority of
  people here have commit access only for particular areas of the
  project:
 
  http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS 
 http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS
 
  On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:
   Hey Greg,
  
   Regarding subversion - I think the reference is to partial vs full
   committers here:
   https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html
  
   - Patrick
  
   On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto:
 gst...@gmail.com wrote:
   -1 (non-binding)
  
   This is an idea that runs COMPLETELY counter to the Apache Way, and is
   to be severely frowned up. This creates *unequal* ownership of the
   codebase.
  
   Each Member of the PMC should have *equal* rights to all areas of the
   codebase until their purview. It should not be subjected to others'
   ownership except throught the standard mechanisms of reviews and
   if/when absolutely necessary, to vetos.
  
   Apache does not want leads, benevolent dictators or assigned
   maintainers, no matter how you may dress it up with multiple
   maintainers per component. The fact is that this creates an unequal
   level of ownership and responsibility. The Board has shut down
   projects that attempted or allowed for Leads. Just a few months ago,
   there was a problem with somebody calling themself a Lead.
  
   I don't know why you suggest that Apache Subversion does this. We
   absolutely do not. Never have. Never will. The Subversion codebase is
   owned by all of us, and we all care for every line of it. Some people
   know more than others, of course. But any one of us, can change any
   part, without being subjected to a maintainer. Of course, we ask
   people with more knowledge of the component when we feel
   uncomfortable, but we also know when it is safe or not to make a
   specific change. And *always*, our fellow committers can review our
   work and let us know when we've done something wrong.
  
   Equal ownership reduces fiefdoms, enhances a feeling of community and
   project ownership, and creates a more open and inviting project.
  
   So again: -1 on this entire concept. Not good, to be polite.
  
   Regards,
   Greg Stein
   Director, Vice Chairman
   Apache Software Foundation
  
   On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
   Hi all,
  
   I wanted to share a 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
I'm actually going to change my non-binding to +0 for the proposal as-is.

I overlooked some parts of the original proposal that, when reading over
them again, do not sit well with me. one of the maintainers needs to sign
off on each patch to the component, as Greg has pointed out, does seem to
imply that there are committers with more power than others with regards to
specific components- which does imply ownership.

My thinking would be to re-work in some way as to take out the accent on
ownership. I would maybe focus on things such as:

1) Other committers and contributors being forced to consult with
maintainers of modules before patches can get rolled in.
2) Maintainers being assigned specifically from PMC.
3) Oversight to have more accent on keeping the community happy in a
specific area of interest vice being a consultant for the design of a
specific piece.

On Thu, Nov 6, 2014 at 8:46 PM, Arun C Murthy a...@hortonworks.com wrote:

 With my ASF Member hat on, I fully agree with Greg.

 As he points out, this is an anti-pattern in the ASF and is severely
 frowned upon.

 We, in Hadoop, had a similar trajectory where we had were politely told to
 go away from having sub-project committers (HDFS, MapReduce etc.) to a
 common list of committers. There were some concerns initially, but we have
 successfully managed to work together and build a more healthy community as
 a result of following the advice on the ASF Way.

 I do have sympathy for good oversight etc. as the project grows and
 attracts many contributors - it's essentially the need to have smaller,
 well-knit developer communities. One way to achieve that would be to have
 separate TLPs  (e.g. Spark, MLLIB, GraphX) with separate committer lists
 for each representing the appropriate community. Hadoop went a similar
 route where we had Pig, Hive, HBase etc. as sub-projects initially and then
 split them into TLPs with more focussed communities to the benefit of
 everyone. Maybe you guys want to try this too?

 

 Few more observations:
 # In general, *discussions* on project directions (such as new concept of
 *maintainers*) should happen first on the public lists *before* voting, not
 in the private PMC list.
 # If you chose to go this route in spite of this advice, seems to me Spark
 would be better of having more maintainers per component (at least 4-5),
 probably with a lot more diversity in terms of affiliations. Not sure if
 that is a concern - do you have good diversity in the proposed list? This
 will ensure that there are no concerns about a dominant employer
 controlling a project.

 

 Hope this helps - we've gone through similar journey, got through similar
 issues and fully embraced the Apache Way (™) as Greg points out to our
 benefit.

 thanks,
 Arun


 On Nov 6, 2014, at 4:18 PM, Greg Stein gst...@gmail.com wrote:

  -1 (non-binding)
 
  This is an idea that runs COMPLETELY counter to the Apache Way, and is
  to be severely frowned up. This creates *unequal* ownership of the
  codebase.
 
  Each Member of the PMC should have *equal* rights to all areas of the
  codebase until their purview. It should not be subjected to others'
  ownership except throught the standard mechanisms of reviews and
  if/when absolutely necessary, to vetos.
 
  Apache does not want leads, benevolent dictators or assigned
  maintainers, no matter how you may dress it up with multiple
  maintainers per component. The fact is that this creates an unequal
  level of ownership and responsibility. The Board has shut down
  projects that attempted or allowed for Leads. Just a few months ago,
  there was a problem with somebody calling themself a Lead.
 
  I don't know why you suggest that Apache Subversion does this. We
  absolutely do not. Never have. Never will. The Subversion codebase is
  owned by all of us, and we all care for every line of it. Some people
  know more than others, of course. But any one of us, can change any
  part, without being subjected to a maintainer. Of course, we ask
  people with more knowledge of the component when we feel
  uncomfortable, but we also know when it is safe or not to make a
  specific change. And *always*, our fellow committers can review our
  work and let us know when we've done something wrong.
 
  Equal ownership reduces fiefdoms, enhances a feeling of community and
  project ownership, and creates a more open and inviting project.
 
  So again: -1 on this entire concept. Not good, to be polite.
 
  Regards,
  Greg Stein
  Director, Vice Chairman
  Apache Software Foundation
 
  On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed 

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-11-06 Thread Corey Nolet
Michael,

Thanks for the explanation. I was able to get this running.

On Wed, Oct 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com
wrote:

 We are working on more helpful error messages, but in the meantime let me
 explain how to read this output.

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'p.name,'p.age, tree:

 Project ['p.name,'p.age]
  Filter ('location.number = 2300)
   Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber))
Generate explode(locations#10), true, false, Some(l)
 LowerCaseSchema
  Subquery p
   Subquery people
SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11],
 MappedRDD[28] at map at JsonRDD.scala:38)
LowerCaseSchema
 Subquery ln
  Subquery locationNames
   SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
 MappedRDD[99] at map at JsonRDD.scala:38)

 'tickedFields indicate a failure to resolve, where as numbered#10
 attributes have been resolved. (The numbers are globally unique and can be
 used to disambiguate where a column is coming from when the names are the
 same)

 Resolution happens bottom up.  So the first place that there is a problem
 is 'ln.streetnumber, which prevents the rest of the query from resolving.
 If you look at the subquery ln, it is only producing two columns:
 locationName and locationNumber. So streetnumber is not valid.


 On Tue, Oct 28, 2014 at 8:02 PM, Corey Nolet cjno...@gmail.com wrote:

 scala locations.queryExecution

 warning: there were 1 feature warning(s); re-run with -feature for details

 res28: _4.sqlContext.QueryExecution forSome { val _4:
 org.apache.spark.sql.SchemaRDD } =

 == Parsed Logical Plan ==

 SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
 MappedRDD[99] at map at JsonRDD.scala:38)


 == Analyzed Logical Plan ==

 SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
 MappedRDD[99] at map at JsonRDD.scala:38)


 == Optimized Logical Plan ==

 SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
 MappedRDD[99] at map at JsonRDD.scala:38)


 == Physical Plan ==

 ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at
 JsonRDD.scala:38


 Code Generation: false

 == RDD ==


 scala people.queryExecution

 warning: there were 1 feature warning(s); re-run with -feature for details

 res29: _5.sqlContext.QueryExecution forSome { val _5:
 org.apache.spark.sql.SchemaRDD } =

 == Parsed Logical Plan ==

 SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
 at map at JsonRDD.scala:38)


 == Analyzed Logical Plan ==

 SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
 at map at JsonRDD.scala:38)


 == Optimized Logical Plan ==

 SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
 at map at JsonRDD.scala:38)


 == Physical Plan ==

 ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at
 JsonRDD.scala:38


 Code Generation: false

 == RDD ==



 Here's when I try executing the join and the lateral view explode() :


 14/10/28 23:05:35 INFO ParseDriver: Parse Completed

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved attributes: 'p.name,'p.age, tree:

 Project ['p.name,'p.age]

  Filter ('location.number = 2300)

   Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber))

Generate explode(locations#10), true, false, Some(l)

 LowerCaseSchema

  Subquery p

   Subquery people

SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11],
 MappedRDD[28] at map at JsonRDD.scala:38)

LowerCaseSchema

 Subquery ln

  Subquery locationNames

   SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
 MappedRDD[99] at map at JsonRDD.scala:38)


 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply

Configuring custom input format

2014-11-05 Thread Corey Nolet
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD.
Creating the new RDD works fine but setting up the configuration file via
the static methods on input formats that require a Hadoop Job object is
proving to be difficult.

Trying to new up my own Job object with the
SparkContext.hadoopConfiguration is throwing the exception on line 283 of
this grepcode:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job

Looking in the SparkContext code, I'm seeing that it's newing up Job
objects just fine using nothing but the configuraiton. Using
SparkContext.textFile() appears to be working for me. Any ideas? Has anyone
else run into this as well? Is it possible to have a method like
SparkContext.getJob() or something similar?

Thanks.


Re: Configuring custom input format

2014-11-05 Thread Corey Nolet
The closer I look @ the stack trace in the Scala shell, it appears to be
the call to toString() that is causing the construction of the Job object
to fail. Is there a ways to suppress this output since it appears to be
hindering my ability to new up this object?

On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD.
 Creating the new RDD works fine but setting up the configuration file via
 the static methods on input formats that require a Hadoop Job object is
 proving to be difficult.

 Trying to new up my own Job object with the
 SparkContext.hadoopConfiguration is throwing the exception on line 283 of
 this grepcode:


 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job

 Looking in the SparkContext code, I'm seeing that it's newing up Job
 objects just fine using nothing but the configuraiton. Using
 SparkContext.textFile() appears to be working for me. Any ideas? Has anyone
 else run into this as well? Is it possible to have a method like
 SparkContext.getJob() or something similar?

 Thanks.




Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet
Michael,

I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?

On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari sbir...@wynyardgroup.com
wrote:

 Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
 I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more
 than
 earlier).

 I also tried by changing schema to use Long data type in some fields but
 seems conversion takes more time.
 Is there any way to specify index ?  Though I checked and didn't found any,
 just want to confirm.

 For your reference here is the snippet of code.


 -
 case class EventDataTbl(EventUID: Long,
 ONum: Long,
 RNum: Long,
 Timestamp: java.sql.Timestamp,
 Duration: String,
 Type: String,
 Source: String,
 OName: String,
 RName: String)

 val format = new java.text.SimpleDateFormat(-MM-dd
 hh:mm:ss)
 val cedFileName =
 hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2
 val cedRdd = sc.textFile(cedFileName).map(_.split(,,
 -1)).map(p =
 EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new
 java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7),
 p(8)))

 cedRdd.registerTempTable(EventDataTbl)
 sqlCntxt.cacheTable(EventDataTbl)

 val t1 = System.nanoTime()
 println(\n\n10 Most frequent conversations between the
 Originators and
 Recipients\n)
 sql(SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName
 FROM EventDataTbl
 GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
 10).collect().foreach(println)
 val t2 = System.nanoTime()
 println(Time taken  + (t2-t1)/10.0 +  Seconds)


 -

 Thanks,
   Shailesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Why mapred for the HadoopRDD?

2014-11-04 Thread Corey Nolet
I'm fairly new to spark and I'm trying to kick the tires with a few
InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf
instead of a MapReduce Job object. Is there future planned support for the
mapreduce packaging?


Re: unsubscribe

2014-10-31 Thread Corey Nolet
Hongbin,

Please send an email to user-unsubscr...@spark.apache.org in order to
unsubscribe from the user list.

On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote:

  Apology for having to send to all.



 I am highly interested in spark, would like to stay in this mailing list.
 But the email I signed up is not right one. The link below to unsubscribe
 seems not working.



 https://spark.apache.org/community.html



 Can anyone help?

 --
 This message may contain confidential information and is intended for
 specific recipients unless explicitly noted otherwise. If you have reason
 to believe you are not an intended recipient of this message, please delete
 it and notify the sender. This message may not represent the opinion of
 Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and
 does not constitute a contract or guarantee. Unencrypted electronic mail is
 not secure and the recipient of this message is expected to provide
 safeguards from viruses and pursue alternate means of communication where
 privacy or a binding message is desired.



Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
So it wouldn't be possible to have a json string like this:

{ name:John, age:53, locations: [{ street:Rodeo Dr,
number:2300 }]}

And query all people who have a location with number = 2300?




On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote:

 Is it possible to select if, say, there was an addresses field that had a
 json array?

 You can get the Nth item by address.getItem(0).  If you want to walk
 through the whole array look at LATERAL VIEW EXPLODE in HiveQL




Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
Michael,

Awesome, this is what I was looking for. So it's possible to use hive
dialect in a regular sql context? This is what was confusing to me- the
docs kind of allude to it but don't directly point it out.

On Tue, Oct 28, 2014 at 9:30 PM, Michael Armbrust mich...@databricks.com
wrote:

 You can do this:

 $ sbt/sbt hive/console

 scala jsonRDD(sparkContext.parallelize({ name:John, age:53,
 locations: [{ street:Rodeo Dr, number:2300 }]} ::
 Nil)).registerTempTable(people)

 scala sql(SELECT name FROM people LATERAL VIEW explode(locations) l AS
 location WHERE location.number = 2300).collect()
 res0: Array[org.apache.spark.sql.Row] = Array([John])

 This will double show people who have more than one matching address.

 On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote:

 So it wouldn't be possible to have a json string like this:

 { name:John, age:53, locations: [{ street:Rodeo Dr,
 number:2300 }]}

 And query all people who have a location with number = 2300?




 On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com
  wrote:

 On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote:

 Is it possible to select if, say, there was an addresses field that had
 a json array?

 You can get the Nth item by address.getItem(0).  If you want to walk
 through the whole array look at LATERAL VIEW EXPLODE in HiveQL







Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
Am I able to do a join on an exploded field?

Like if I have another object:

{ streetNumber:2300, locationName:The Big Building} and I want to
join with the previous json by the locations[].number field- is that
possible?


On Tue, Oct 28, 2014 at 9:31 PM, Corey Nolet cjno...@gmail.com wrote:

 Michael,

 Awesome, this is what I was looking for. So it's possible to use hive
 dialect in a regular sql context? This is what was confusing to me- the
 docs kind of allude to it but don't directly point it out.

 On Tue, Oct 28, 2014 at 9:30 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You can do this:

 $ sbt/sbt hive/console

 scala jsonRDD(sparkContext.parallelize({ name:John, age:53,
 locations: [{ street:Rodeo Dr, number:2300 }]} ::
 Nil)).registerTempTable(people)

 scala sql(SELECT name FROM people LATERAL VIEW explode(locations) l AS
 location WHERE location.number = 2300).collect()
 res0: Array[org.apache.spark.sql.Row] = Array([John])

 This will double show people who have more than one matching address.

 On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote:

 So it wouldn't be possible to have a json string like this:

 { name:John, age:53, locations: [{ street:Rodeo Dr,
 number:2300 }]}

 And query all people who have a location with number = 2300?




 On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote:

 Is it possible to select if, say, there was an addresses field that
 had a json array?

 You can get the Nth item by address.getItem(0).  If you want to walk
 through the whole array look at LATERAL VIEW EXPLODE in HiveQL








Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
scala locations.queryExecution

warning: there were 1 feature warning(s); re-run with -feature for details

res28: _4.sqlContext.QueryExecution forSome { val _4:
org.apache.spark.sql.SchemaRDD } =

== Parsed Logical Plan ==

SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
MappedRDD[99] at map at JsonRDD.scala:38)


== Analyzed Logical Plan ==

SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
MappedRDD[99] at map at JsonRDD.scala:38)


== Optimized Logical Plan ==

SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
MappedRDD[99] at map at JsonRDD.scala:38)


== Physical Plan ==

ExistingRdd [locationName#80,locationNumber#81], MappedRDD[99] at map at
JsonRDD.scala:38


Code Generation: false

== RDD ==


scala people.queryExecution

warning: there were 1 feature warning(s); re-run with -feature for details

res29: _5.sqlContext.QueryExecution forSome { val _5:
org.apache.spark.sql.SchemaRDD } =

== Parsed Logical Plan ==

SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
at map at JsonRDD.scala:38)


== Analyzed Logical Plan ==

SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
at map at JsonRDD.scala:38)


== Optimized Logical Plan ==

SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11], MappedRDD[28]
at map at JsonRDD.scala:38)


== Physical Plan ==

ExistingRdd [age#9,locations#10,name#11], MappedRDD[28] at map at
JsonRDD.scala:38


Code Generation: false

== RDD ==



Here's when I try executing the join and the lateral view explode() :


14/10/28 23:05:35 INFO ParseDriver: Parse Completed

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'p.name,'p.age, tree:

Project ['p.name,'p.age]

 Filter ('location.number = 2300)

  Join Inner, Some((location#110.number AS number#111 = 'ln.streetnumber))

   Generate explode(locations#10), true, false, Some(l)

LowerCaseSchema

 Subquery p

  Subquery people

   SparkLogicalPlan (ExistingRdd [age#9,locations#10,name#11],
MappedRDD[28] at map at JsonRDD.scala:38)

   LowerCaseSchema

Subquery ln

 Subquery locationNames

  SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81],
MappedRDD[99] at map at JsonRDD.scala:38)


at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397)

at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:358)

at
org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:357)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)

 at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)

On Tue, Oct 28, 2014 at 10:48 PM, Michael Armbrust mich...@databricks.com
wrote:

 Can you println the .queryExecution of the SchemaRDD?

 On Tue, Oct 28, 2014 at 7:43 PM, Corey Nolet cjno...@gmail.com wrote:

 So this appears to work just fine:

 hctx.sql(SELECT p.name, p.age  FROM people p LATERAL VIEW
 explode(locations) l AS location JOIN location5 lo ON l.number =
 lo.streetNumber WHERE location.number = '2300').collect()

 But as soon as I try to join with another set based on a property from
 the exploded locations set, I get invalid attribute exceptions:

 hctx.sql(SELECT p.name, p.age, ln.locationName  FROM people as p
 LATERAL VIEW explode(locations) l AS location JOIN locationNames ln ON
 location.number = ln.streetNumber WHERE location.number

Re: Accumulo version at runtime?

2014-10-24 Thread Corey Nolet
Dylan,

I know your original post mentioned grabbing it through the client API but
there's not currently a way to do that. As Sean mentioned, you can do it if
you have access to the cluster. You can run the reflection Keith provided
by adding the files in $ACCUMULO_HOME/lib/ to your classpath and running
your code on the cluster. I definitely think exposing the server version
should make it into 1.7.

On Fri, Oct 24, 2014 at 6:00 PM, Dylan Hutchison dhutc...@stevens.edu
wrote:

 Keith, I'm confused as to how you would run reflection

 Constants.class.getDeclaredField(VERSION).get(String.class)

 on the Accumulo server.  Wouldn't we need to compile in a function
 returning that into the server for a custom Accumulo server build?

 Say we have a 1.6 Accumulo instance live.  A client needs to know the
 version in order to load the appropriate class.  How would you execute that
 code on the server if it is built from the standard Accumulo 1.6 binaries?

 On Fri, Oct 24, 2014 at 10:22 AM, Keith Turner ke...@deenlo.com wrote:

 Below is something I wrote for 1.4 that would grab the version using
 reflection.  If the constant has moved in later versions, could gracefully
 handle class not found and field not found exceptions and look in other
 places.


 https://github.com/keith-turner/instamo/blob/master/src/main/java/instamo/MiniAccumuloCluster.java#L357

 On Fri, Oct 24, 2014 at 1:38 AM, Dylan Hutchison dhutc...@stevens.edu
 wrote:

 How about a compromise: create *two classes *for the two versions, both
 implementing the same interface.  Instantiate the class for the correct
 version either from (1) static configuration file or (2) runtime hack
 lookup to the version on the Monitor.

 (1) gives safety at the expense of the user having to specify another
 parameter.  (2) looks like it will work at least in the near future going
 to 1.7, as well as for past versions.

 Thanks for the suggestions!  I like the two classes approach better both
 as a developer and as a user; no need to juggle JARs.

 ~Dylan


 On Fri, Oct 24, 2014 at 12:41 AM, Sean Busbey bus...@cloudera.com
 wrote:


 On Thu, Oct 23, 2014 at 10:38 PM, Dylan Hutchison dhutc...@stevens.edu
  wrote:


 I'm working on a clean way to handle getting Accumulo monitor info
 for different versions of Accumulo, since I used methods to extract that
 information from Accumulo's internals which are version-dependent. As Sean
 wrote, these are not things one should do, but if it's a choice between
 getting the info or not...
 We're thinking of building separate JARs for each 1.x version.


 Why not just take the version of Accumulo you're going to talk to as
 configuration information that's given to you as a part of deploying your
 software?

 It'll make your life much simpler in the long run.


 --
 Sean




 --
 www.cs.stevens.edu/~dhutchis





 --
 www.cs.stevens.edu/~dhutchis



Re: Raise Java dependency from 6 to 7

2014-10-19 Thread Corey Nolet
A concrete plan and a definite version upon which the upgrade would be
applied sounds like it would benefit the community. If you plan far enough
out (as Hadoop has done) and give the community enough of a notice, I can't
see it being a problem as they would have ample time upgrade.



On Sat, Oct 18, 2014 at 9:20 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hadoop, for better or worse, depends on an ancient version of Jetty
 (6), that is even on a different package. So Spark (or anyone trying
 to use a newer Jetty) is lucky on that front...

 IIRC Hadoop is planning to move to Java 7-only starting with 2.7. Java
 7 is also supposed to be EOL some time next year, so a plan to move to
 Java 7 and, eventually, Java 8 would be nice.

 On Sat, Oct 18, 2014 at 5:44 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  I'd also wait a bit until these are gone. Jetty is unfortunately a much
 hairier topic by the way, because the Hadoop libraries also depend on
 Jetty. I think it will be hard to update. However, a patch that shades
 Jetty might be nice to have, if that doesn't require shading a lot of other
 stuff.
 
  Matei
 
  On Oct 18, 2014, at 4:37 PM, Koert Kuipers ko...@tresata.com wrote:
 
  my experience is that there are still a lot of java 6 clusters out
 there.
  also distros that bundle spark still support java 6
  On Oct 17, 2014 8:01 PM, Andrew Ash and...@andrewash.com wrote:
 
  Hi Spark devs,
 
  I've heard a few times that keeping support for Java 6 is a priority
 for
  Apache Spark.  Given that Java 6 has been publicly EOL'd since Feb 2013
  http://www.oracle.com/technetwork/java/eol-135779.html and the last
  public update was Apr 2013
  https://en.wikipedia.org/wiki/Java_version_history#Java_6_updates,
 why
  are we still maintaing support for 6?  The only people using it now
 must be
  paying for the extended support to continue receiving security fixes.
 
  Bumping the lower bound of Java versions up to Java 7 would allow us to
  upgrade from Jetty 8 to 9, which is currently a conflict with the
  Dropwizard framework and a personal pain point.
 
  Java 6 vs 7 for Spark links:
  Try with resources
  https://github.com/apache/spark/pull/2575/files#r18152125 for
  SparkContext et al
  Upgrade to Jetty 9
  https://github.com/apache/spark/pull/167#issuecomment-54544494
  Warn when not compiling with Java6
  https://github.com/apache/spark/pull/859
 
 
  Who are the people out there that still need Java 6 support?
 
  Thanks!
  Andrew
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: How can I use a time window with the trident api?

2014-10-10 Thread Corey Nolet
I started a project to do sliding and tumbling windows in Storm. It could
be used directly or as an example.

http://github.com/calrissian/flowmix
On Oct 9, 2014 11:54 PM, 姚驰 yaoch...@163.com wrote:

 Hello, I'm trying to use storm to manipulate our monitoring data, but I
 don't know how to add a time window support under trident api.
 Can anyone help, thanks a lot.





[ANNOUNCE] Fluo 1.0.0-alpha-1 Released

2014-10-09 Thread Corey Nolet
The Fluo project is happy to announce the 1.0.0-alpha-1 release of Fluo.

Fluo is a transaction layer that enables incremental processing on top of
Accumulo. It integrates into Yarn using Apache Twill.

This is the first release of Fluo and is not ready for production use. We
invite developers to try it out, play with the quickstart  examples, and
contribute back in the form of bug reports, new feature requests, and pull
requests.

For more information, visit http://www.fluo.io.


Re: C++ accumulo client -- native clients for Python, Go, Ruby etc

2014-10-06 Thread Corey Nolet
I'm all for this- though I'm curious to know the thoughts about maintenance
and the design. Are we going to use thrift to tie the C++ client calls into
the server-side components? Is that going to be maintained through a
separate effort or is the plan to  have the Accumulo community officially
support it?

On Mon, Oct 6, 2014 at 2:34 PM, Josh Elser josh.el...@gmail.com wrote:

 It'd be really cool to see a C++ client -- fully implemented or not. The
 increased performance via other languages like you said would be really
 nice, but I'd also be curious to see how the server characteristics change
 when the client might be sending data at a much faster rate.

 My C++ is super rusty these days, but I'd be happy to help out any devs
 who can spearhead the effort :)


 John R. Frank wrote:

 Accumulo Developers,

 We're trying to boost throughput of non-Java tools with Accumulo.  It
 seems that the lowest hanging fruit is to stop using the thrift proxy. Per
 discussion about Python and thrift proxy in the users list [1], I'm
 wondering if anyone is interested in helping with a native C++ client?
 There is a start on one here [2]. We could offer a bounty or maybe make a
 consulting project depending who is interested in it.

 We also looked at trying to run a separate thrift proxy for every worker
 thread or process.  With many cores on a box, eg 32, it just doesn't seem
 practical to run that many proxies, even if they all run on a single JVM.
 We'd be glad to hear ideas on that front too.

 A potentially big benefit of making a proper C++ accumulo client is that
 it is straightforward to expose native interfaces in Python (via pyObject),
 Go [3], Ruby [4], and other languages.

 Thanks for any advice, pointers, interest.

 John


 1-- http://www.mail-archive.com/user@accumulo.apache.org/msg03999.html

 2--
 https://github.com/phrocker/apeirogon

 3-- http://golang.org/cmd/cgo/

 4-- https://www.amberbit.com/blog/2014/6/12/calling-c-cpp-from-ruby/


 Sent from +1-617-899-2066




[ANNOUNCE] Apache 1.6.1 Released

2014-10-03 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.1 release.

Version 1.6.1 is the most recent bug-fix release in its 1.6.x release line.
This version includes numerous bug fixes and performance improvements over
previous versions. Existing users of 1.6.x are encouraged to upgrade to
this version. Users new to Accumulo are encouraged to start with this
version as well.

The Apache Accumulo sorted, distributed key/value store is a robust,
scalable, high performance data storage system that features cell-based
access control and customizable server-side processing.  It is based on
Google's BigTable design and is built on top of Apache Hadoop, Apache
Zookeeper, and Apache Thrift.

The release is available at http://accumulo.apache.org/downloads/ and
release notes at http://accumulo.apache.org/release_notes/1.6.1.html.


Thanks.

- The Apache Accumulo Team


[ANNOUNCE] Apache 1.6.1 Released

2014-10-03 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.1 release.

Version 1.6.1 is the most recent bug-fix release in its 1.6.x release line.
This version includes numerous bug fixes and performance improvements over
previous versions. Existing users of 1.6.x are encouraged to upgrade to
this version. Users new to Accumulo are encouraged to start with this
version as well.

The Apache Accumulo sorted, distributed key/value store is a robust,
scalable, high performance data storage system that features cell-based
access control and customizable server-side processing.  It is based on
Google's BigTable design and is built on top of Apache Hadoop, Apache
Zookeeper, and Apache Thrift.

The release is available at http://accumulo.apache.org/downloads/ and
release notes at http://accumulo.apache.org/release_notes/1.6.1.html.


Thanks.

- The Apache Accumulo Team


Re: Accumulo Powered By Logo

2014-10-02 Thread Corey Nolet
I think a logo that's more friendly to place in a circle would be useful.
The Accumulo logo is very squared off.

On Thu, Oct 2, 2014 at 3:39 PM, Mike Drob mad...@cloudera.com wrote:

 Yea, as an outside observer, I would have no idea what Apache A is, nor
 any idea how to get more information. Maybe we just need a different logo,
 altogether, given the context of putting it in the PBA circle.

 On Thu, Oct 2, 2014 at 2:36 PM, Keith Turner ke...@deenlo.com wrote:

  I was a looking at the new Accumulo powered logo [1] and thought that
 just
  an A[2] may be better.  Any other thoughts on how to improve this?
 
  Someone mentioned that just the A[2] isn't as informative in the case
 where
  someone is completely unfamiliar w/ Accumulo.
 
  [1]: http://apache.org/foundation/press/kit/poweredBy/pb-accumulo.jpg
  [2]: http://people.apache.org/~kturner/pb-accumulo.png
 



Re: [VOTE] Apache Accumulo 1.6.1 RC1

2014-09-25 Thread Corey Nolet
I'm seeing the behavior under Max OS X and Fedora 19 and they have been
consistently failing for me. I'm thinking ACCUMULO-3073. Since others are
able to get it to pass, I did not think it should fail the vote solely on
that but I do think it needs attention, quickly.

On Thu, Sep 25, 2014 at 10:43 AM, Bill Havanki bhava...@clouderagovt.com
wrote:

 I haven't had an opportunity to try it again since my +1, but prior to that
 it has been consistently failing.

 - I tried extending the timeout on the test, but it would still time out.
 - I see the behavior on Mac OS X and under CentOS. (I wonder if it's a JVM
 thing?)

 On Wed, Sep 24, 2014 at 9:06 PM, Corey Nolet cjno...@gmail.com wrote:

  Vote passes with 4 +1's and no -1's.
 
  Bill, were you able to get the IT to run yet? I'm still having timeouts
 on
  my end as well.
 
 
  On Wed, Sep 24, 2014 at 1:41 PM, Josh Elser josh.el...@gmail.com
 wrote:
 
   The crux of it is that both of the errors in the CRC where single bit
   variants.
  
   y instead of 9 and p instead of 0
  
   Both of these cases are a '1' in the most significant bit of the byte
   instead of a '0'. We recognized these because y and p are outside of
 the
   hex range. Fixing both of these fixes the CRC error (manually
 verified).
  
   That's all we know right now. I'm currently running memtest86. I do not
   have ECC ram, so it *is* theoretically possible that was the cause.
 After
   running memtest for a day or so (or until I need my desktop functional
   again), I'll go back and see if I can reproduce this again.
  
  
   Mike Drob wrote:
  
   Any chance the IRC chats can make it only the ML for posterity?
  
   Mike
  
   On Wed, Sep 24, 2014 at 12:04 PM, Keith Turnerke...@deenlo.com
  wrote:
  
On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks
 rwe...@newbrightidea.com
   wrote:
  
Interesting that y (0x79) and 9 (0x39) are one bit away from
  each
   other. I blame cosmic rays!
  
It is interesting, and thats only half of the story.  Its been
   interesting
   chatting w/ Josh about this on irc and hearing about his findings.
  
  
On Wed, Sep 24, 2014 at 9:05 AM, Josh Elserjosh.el...@gmail.com
  
   wrote:
  
   The offending keys are:
  
   389a85668b6ebf8e 2ff6:4a78 [] 1411499115242
  
   3a10885b-d481-4d00-be00-0477e231ey65:8576b169:
   0cd98965c9ccc1d0:ba15529e
  
The careful eye will notice that the UUID in the first component
  of
   the
   value has a different suffix than the next corrupt key/value (ends
  with
   ey65 instead of e965). Fixing this in the Value and re-running
  the
  
   CRC
  
   makes it pass.
  
  
 and
  
   7e56b58a0c7df128 5fa0:6249 [] 1411499311578
  
   3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb:
   499fa72752d82a7c:5c5f19e8
  
  
  
  
 



 --
 // Bill Havanki
 // Solutions Architect, Cloudera Govt Solutions
 // 443.686.9283



Re: [accumulo] your /dist/ artifacts - 1 BAD signature

2014-09-25 Thread Corey Nolet
I see what happened. I was expecting the mvn:release plugin to push the
prepare for next development iteration which it did not. I just pushed it
up and created the tag. I'll work on the release notes in a bit.

On Thu, Sep 25, 2014 at 3:33 PM, Christopher ctubb...@apache.org wrote:

 [note: thread moved to dev@]

 Okay, I just confirmed that the current files in dist are the same ones in
 Maven Central are the same ones that we voted on. So, that issue is
 resolved. I double checked and saw that the gpg-signed tag hasn't been
 created for 1.6.1 (git tag -s 1.6.1 origin/1.6.1-rc1). I guess technically
 anybody could do this, and merge it (along with the version bump to
 1.6.2-SNAPSHOT commit) to 1.6.2-SNAPSHOT branch (and forward, with -sours),
 if Corey doesn't have time/gets busy.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Sep 25, 2014 at 2:21 PM, Corey Nolet cjno...@gmail.com wrote:

  There's still a few things I need to do before announcing the release to
  the user list. Merging the rc into the next version branch was one of
 them
  and creating the official release tag was another. I'll do these tonight
 as
  well as writing up the release notes for the site.
 
 
  On Thu, Sep 25, 2014 at 1:59 PM, Christopher ctubb...@apache.org
 wrote:
 
   Also, we can move this list to dev@. There's no reason for it to be
   private@
   .
  
  
   --
   Christopher L Tubbs II
   http://gravatar.com/ctubbsii
  
   On Thu, Sep 25, 2014 at 1:59 PM, Christopher ctubb...@apache.org
  wrote:
  
There's one more problem that Keith and I found... it doesn't look
 like
the rc1 branch got merged to 1.6.2-SNAPSHOT. I don't know if some
 other
branch got accidentally merged instead.
   
   
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
   
On Thu, Sep 25, 2014 at 1:40 PM, Josh Elser josh.el...@gmail.com
   wrote:
   
Things look good to me now. I checked the artifacts on dist/ against
   what
I have from evaluating the RC and they appear to match.
   
Anything else we need to do here?
   
   
Christopher wrote:
   
I was able to confirm the signature is bad. When I checked the RC,
  the
signature was good, so I'm guessing the wrong one just got
 uploaded.
  I
don't have a copy of the RC that I had previously downloaded, but I
  was
able to grab a copy of what was deployed to Maven central and fix
 the
dist
sigs/checksums from that.
   
Now, it's possible that the wrong artifacts were uploaded to Maven
central
(perhaps the wrong staging repo was promoted?) I can't know that
 for
sure,
until I can get to work and check my last download from the RC vote
  and
compare with what's in Maven central now. If that is the case, then
  we
need
to determine precisely what is different from this upload and what
  was
voted on and see if we need to immediately re-release as 1.6.2 to
 fix
   the
problems.
   
   
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
   
On Thu, Sep 25, 2014 at 3:12 AM, Henk Penninghe...@apache.org
   wrote:
   
 Hi PMC accumulo,
   
   I watch 'www.apache.org/dist/', and I noticed that :
   
   -- you have 1 BAD pgp signature
   
accumulo/1.6.1/accumulo-1.6.1-src.tar.gz.asc
   
   Please fix this problem soon ; for details, see
   
   
   http://people.apache.org/~henkp/checker/sig.html#project-accumulo
 http://people.apache.org/~henkp/checker/md5.html
   
   For information on how to fix problems, see the faq :
   
 http://people.apache.org/~henkp/checker/faq.html
   
   Thanks a lot, regards,
   
   Henk Penning -- apache.org infrastructure
   
   PS. The contents of this message is generated,
   but the mail itself is sent by hand.
   PS. Please cc me on all relevant emails.
   
-   _
Henk P. Penning, ICT-beta  R Uithof WISK-412  _/ _
Faculty of Science, Utrecht University T +31 30 253 4106 / _/
Budapestlaan 6, 3584CD Utrecht, NL F +31 30 253 4553 _/ _/
http://people.cs.uu.nl/henkp/  M penn...@uu.nl _/
   
   
   
   
  
 



Re: [VOTE] Apache Accumulo 1.6.1 RC1

2014-09-25 Thread Corey Nolet
Christopher, are you referring to Keith's last comment or Bill Slacum's?

On Thu, Sep 25, 2014 at 9:13 PM, Christopher ctubb...@apache.org wrote:

 That seems like a reason to vote -1 (and perhaps to encourage others to do
 so also). I'm not sure this can be helped so long as people have different
 criteria for their vote, though. If we can fix those issues, I'm ready to
 vote on a 1.6.2 :)


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Sep 25, 2014 at 2:42 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  I'm a little concerned we had two +1's that mention failures. The one
 time
  when we're supposed to have a clean run through, we have 50% of the
  participators noticing failure. It doesn't instill much confidence in me.
 
  On Thu, Sep 25, 2014 at 2:18 PM, Josh Elser josh.el...@gmail.com
 wrote:
 
   Please make a ticket for it and supply the MAC directories for the test
   and the failsafe output.
  
   It doesn't fail for me. It's possible that there is some edge case that
   you and Bill are hitting that I'm not.
  
  
   Corey Nolet wrote:
  
   I'm seeing the behavior under Max OS X and Fedora 19 and they have
 been
   consistently failing for me. I'm thinking ACCUMULO-3073. Since others
  are
   able to get it to pass, I did not think it should fail the vote solely
  on
   that but I do think it needs attention, quickly.
  
   On Thu, Sep 25, 2014 at 10:43 AM, Bill Havanki
  bhava...@clouderagovt.com
   wrote:
  
I haven't had an opportunity to try it again since my +1, but prior
 to
   that
   it has been consistently failing.
  
   - I tried extending the timeout on the test, but it would still time
  out.
   - I see the behavior on Mac OS X and under CentOS. (I wonder if it's
 a
   JVM
   thing?)
  
   On Wed, Sep 24, 2014 at 9:06 PM, Corey Noletcjno...@gmail.com
  wrote:
  
Vote passes with 4 +1's and no -1's.
  
   Bill, were you able to get the IT to run yet? I'm still having
  timeouts
  
   on
  
   my end as well.
  
  
   On Wed, Sep 24, 2014 at 1:41 PM, Josh Elserjosh.el...@gmail.com
  
   wrote:
  
   The crux of it is that both of the errors in the CRC where single
 bit
   variants.
  
   y instead of 9 and p instead of 0
  
   Both of these cases are a '1' in the most significant bit of the
 byte
   instead of a '0'. We recognized these because y and p are outside
 of
  
   the
  
   hex range. Fixing both of these fixes the CRC error (manually
  
   verified).
  
   That's all we know right now. I'm currently running memtest86. I do
  not
   have ECC ram, so it *is* theoretically possible that was the cause.
  
   After
  
   running memtest for a day or so (or until I need my desktop
 functional
   again), I'll go back and see if I can reproduce this again.
  
  
   Mike Drob wrote:
  
Any chance the IRC chats can make it only the ML for posterity?
  
   Mike
  
   On Wed, Sep 24, 2014 at 12:04 PM, Keith Turnerke...@deenlo.com
  
   wrote:
  
 On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks
  
   rwe...@newbrightidea.com
  
   wrote:
  
 Interesting that y (0x79) and 9 (0x39) are one bit away
  from
  
   each
  
   other. I blame cosmic rays!
  
 It is interesting, and thats only half of the story.  Its been
  
   interesting
   chatting w/ Josh about this on irc and hearing about his
 findings.
  
  
 On Wed, Sep 24, 2014 at 9:05 AM, Josh Elser
 josh.el...@gmail.com
  
   wrote:
  
The offending keys are:
  
   389a85668b6ebf8e 2ff6:4a78 [] 1411499115242
  
   3a10885b-d481-4d00-be00-0477e231ey65:8576b169:
   0cd98965c9ccc1d0:ba15529e
  
 The careful eye will notice that the UUID in the first
   component
  
   of
  
   the
   value has a different suffix than the next corrupt key/value
  (ends
  
   with
  
   ey65 instead of e965). Fixing this in the Value and re-running
  
   the
  
   CRC
  
makes it pass.
  
  
  and
  
7e56b58a0c7df128 5fa0:6249 [] 1411499311578
  
   3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb:
   499fa72752d82a7c:5c5f19e8
  
  
  
  
  
   --
   // Bill Havanki
   // Solutions Architect, Cloudera Govt Solutions
   // 443.686.9283
  
  
  
 



Re: [VOTE] Apache Accumulo 1.6.1 RC1

2014-09-24 Thread Corey Nolet
Bill,

I've been having that same IT issue and said the same thing It's not
happening to others. I lifted the timeout completely and it never finished.


On Wed, Sep 24, 2014 at 1:13 PM, Mike Drob mad...@cloudera.com wrote:

 Any chance the IRC chats can make it only the ML for posterity?

 Mike

 On Wed, Sep 24, 2014 at 12:04 PM, Keith Turner ke...@deenlo.com wrote:

  On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks rwe...@newbrightidea.com
  wrote:
 
   Interesting that y (0x79) and 9 (0x39) are one bit away from each
   other. I blame cosmic rays!
  
 
  It is interesting, and thats only half of the story.  Its been
 interesting
  chatting w/ Josh about this on irc and hearing about his findings.
 
 
  
   On Wed, Sep 24, 2014 at 9:05 AM, Josh Elser josh.el...@gmail.com
  wrote:
  
   
The offending keys are:
   
389a85668b6ebf8e 2ff6:4a78 [] 1411499115242
   
3a10885b-d481-4d00-be00-0477e231ey65:8576b169:
0cd98965c9ccc1d0:ba15529e
   
   
The careful eye will notice that the UUID in the first component of
 the
value has a different suffix than the next corrupt key/value (ends
 with
ey65 instead of e965). Fixing this in the Value and re-running
 the
   CRC
makes it pass.
   
   
 and
   
7e56b58a0c7df128 5fa0:6249 [] 1411499311578
   
3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb:
499fa72752d82a7c:5c5f19e8
   
   
  
 



Re: [DISCUSS] Thinking about branch names

2014-09-23 Thread Corey Nolet
+1

Using separate branches in this manner just adds complexity. I was
wondering myself why we needed to create separate branches when all we're
doing is tagging/deleting the already released ones. The only difference
between where one leaves off and another begins  is the name of the branch.


On Tue, Sep 23, 2014 at 9:04 AM, Christopher ctubb...@apache.org wrote:

 +1 to static dev branch names per release series. (this would also fix the
 Jenkins spam when the builds break due to branch name changes)

 However, I kind of prefer 1.5.x or 1.5-dev, or similar, over simply 1.5,
 which looks so much like a release version that I wouldn't want it to
 generate any confusion.

 Also, for reference, here's a few git commands that might help some people
 avoid the situation that happened:
 git remote update
 git remote prune $(git remote)
 git config --global push.default current # git  1.8
 git config --global push.default simple # git = 1.8

 The situation seems to primarily have occurred because of some pushes that
 succeeded because the local clone was not aware that the remote branches
 had disappeared. Pruning will clean those up, so that you'll get an error
 if you try to push. Simple/current push strategy will ensure you don't push
 all matching branches by default. Josh's proposed solution makes it less
 likely the branches will disappear/change on a remote, but these are still
 useful git commands to be aware of, and are related enough to this
 situation, I thought I'd share.



 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Mon, Sep 22, 2014 at 11:18 PM, Josh Elser josh.el...@gmail.com wrote:

  After working on 1.5.2 and today's branch snafu, I think I've come to the
  conclusion that our branch naming is more pain than it's worth (I
 believe I
  was the one who primarily argued for branch names as they are current
  implemented, so take that as you want).
 
  * Trying to making a new branch for the next version as a release is
  happening forces you to fight with Maven. Maven expects that your next
 is
  going to be on the same branch and the way it makes commits and bumps
  versions for you encourages this. Using a new branch for next is more
  manual work for the release manager.
 
  * The time after we make a release, there's a bit of confusion (I do it
  too, just not publicly... yet) about what branch do I put this fix for
  _version_ in?. It's not uncommon to put it in the old branch instead
 of
  the new one. The problem arises when the old branch has already been
  deleted. If a developer has an old version of that branch, there's
 nothing
  to tell them hey, your copy of this branch is behind the remote's copy
 of
  this branch. I'm not accepting your push! Having a single branch for a
  release line removes this hassle.
 
  Pictorially, I'm thinking we would change from the active branches
  {1.5.3-SNAPSHOT, 1.6.1-SNAPSHOT, 1.6.2-SNAPSHOT, master} to {1.5, 1.6,
  master}. (where a git tag would exist for the 1.6.1 RCs).
 
  IIRC, the big argument for per-release branches was of encouraging
  frequent, targeted branches (I know the changes for this version go in
 this
  branch). I think most of this can be mitigated by keeping up with
 frequent
  releases and coordination with the individual cutting the release.
 
  In short, I'm of the opinion that I think we should drop the
 .z-SNAPSHOT
  suffix from branch names (e.g. 1.5.3-SNAPSHOT) and move to a shorter
 x.y
  (e.g. 1.5) that exists for the lifetime of that version. I think we could
  also use this approach if/when we change our versioning to start using
 the
  x component of x.y.z.
 
  Thoughts?
 
  - Josh
 



Re: Apache Storm Graduation to a TLP

2014-09-22 Thread Corey Nolet
Congrats!

On Mon, Sep 22, 2014 at 5:16 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 I’m pleased to announce that Apache Storm has graduated to a Top-Level
 Project (TLP), and I’d like to thank everyone in the Storm community for
 your contributions and help in achieving this important milestone.

 As part of the graduation process, a number of infrastructure changes have
 taken place:

 *New website url:* http://storm.apache.org

 *New git repo urls:*

 https://git-wip-us.apache.org/repos/asf/storm.git (for committer push)

 g...@github.com:apache/storm.git
 -or-
 https://github.com/apache/storm.git (for github pull requests)

 *Mailing Lists:*
 If you are already subscribed, you’re subscription has been migrated. New
 messages should be sent to the new address:

 [list]@storm.apache.org

 This includes any subscribe/unsubscribe requests.

 Note: The mail-archives.apache.org site will not reflect these changes
 until October 1.


 Most of these changes have already occurred and are seamless. Please
 update your git remotes and address books accordingly.

 - Taylor



Re: Apache Storm Graduation to a TLP

2014-09-22 Thread Corey Nolet
Congrats!

On Mon, Sep 22, 2014 at 5:16 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 I’m pleased to announce that Apache Storm has graduated to a Top-Level
 Project (TLP), and I’d like to thank everyone in the Storm community for
 your contributions and help in achieving this important milestone.

 As part of the graduation process, a number of infrastructure changes have
 taken place:

 *New website url:* http://storm.apache.org

 *New git repo urls:*

 https://git-wip-us.apache.org/repos/asf/storm.git (for committer push)

 g...@github.com:apache/storm.git
 -or-
 https://github.com/apache/storm.git (for github pull requests)

 *Mailing Lists:*
 If you are already subscribed, you’re subscription has been migrated. New
 messages should be sent to the new address:

 [list]@storm.apache.org

 This includes any subscribe/unsubscribe requests.

 Note: The mail-archives.apache.org site will not reflect these changes
 until October 1.


 Most of these changes have already occurred and are seamless. Please
 update your git remotes and address books accordingly.

 - Taylor



Re: [VOTE] Apache Accumulo 1.5.2 RC1

2014-09-18 Thread Corey Nolet
If we are concerned with confusion about adoption of new versions, we
should make a point to articulate the purpose very clearly in each of the
announcements. I was in the combined camp an hour ago and now I'm also
thinking we should keep them separate.


On Fri, Sep 19, 2014 at 1:16 AM, Josh Elser josh.el...@gmail.com wrote:

 No we did not bundle any release announcements prior. I also have to agree
 with Bill -- I don't really see how there would be confusion with a
 properly worded announcement.

 Happy to work with anyone who has concerns in this regard to come up with
 something that is agreeable. I do think they should be separate.


 On 9/19/14, 1:02 AM, Mike Drob wrote:

 Did we bundle 1.5.1/1.6.0? If not, they were fairly close together, I
 think. Historically, we have not done a great job of distinguishing our
 release lines, so that has led to confusion. Maybe I'm on the path to
 talking myself out of a combined announcement here.

 On Thu, Sep 18, 2014 at 9:57 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  Not to be a total jerk, but what's unclear about 1.5  1.6? Lots of
 projects have multiple release lines and it's not an issue.

 On Fri, Sep 19, 2014 at 12:18 AM, Mike Drob mad...@cloudera.com wrote:

  +1 to combining. I've already had questions about upgrading to this

 latest

 release from somebody currently on the 1.6 line. Our release narrative

 is

 not clear and we should not muddle the waters.

 On Thu, Sep 18, 2014 at 7:27 PM, Christopher ctubb...@apache.org

 wrote:


  Should we wait to do a release announcement until 1.6.1, so we can

 batch

 the two?

 My main concern here is that I don't want to encourage new 1.5.x

 adoption

 when we have 1.6.x, and having two announcements could be confusing to

 new

 users who aren't sure which version to start using. We could issue an
 announcement that primarily mentions 1.6.1, and also mentions 1.5.2

 second.

 That way, people will see 1.6.x as the stable/focus release, but will

 still

 inform 1.5.x users of updates.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Sep 18, 2014 at 10:20 PM, Josh Elser josh.el...@gmail.com

 wrote:


  Vote passes with 3 +1's and nothing else. Huge thank you to those who

 made

 the time to participate.

 I'll finish up the rest of the release work tonight.

 On 9/15/14, 12:24 PM, Josh Elser wrote:

  Devs,

 Please consider the following candidate for Apache Accumulo 1.5.2

 Tag: 1.5.2rc1
 SHA1: 039a2c28bdd474805f34ee33f138b009edda6c4c
 Staging Repository:
 https://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/

 Source tarball:
 http://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5.
 2/accumulo-1.5.2-src.tar.gz

 Binary tarball:
 http://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5.
 2/accumulo-1.5.2-bin.tar.gz

 (Append .sha1, .md5 or .asc to download the signature/hash

 for a

 given artifact.)

 Signing keys available at:

 https://www.apache.org/dist/accumulo/KEYS


 Over 1.5.1, we have 109 issues resolved
 https://git-wip-us.apache.org/repos/asf?p=accumulo.git;a=
 blob;f=CHANGES;h=c2892d6e9b1c6c9b96b2a58fc901a76363ece8b0;hb=
 039a2c28bdd474805f34ee33f138b009edda6c4c


 Testing: all unit and functional tests are passing and ingested 1B
 entries using CI w/ agitation over rc0.

 Vote will be open until Friday, August 19th 12:00AM UTC (8/18 8:00PM

 ET,

 8/18 5:00PM PT)

 - Josh









Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-09-17 Thread Corey Nolet
Have you looked at the libjars? The uber jar is one approach, but it can
become very ugly very quickly.

http://grepalex.com/2013/02/25/hadoop-libjars/

On Wed, Sep 17, 2014 at 10:59 PM, JavaHokie soozandjohny...@gmail.com
wrote:

 Hi Corey,

 I am now trying to deploy this @ work and I am unable to get this run
 without putting accumulo-core, accumulo-fate, accumulo-trace,
 accumulo-tracer, and accumulo-tserver in the
 $HADOOP_COMMON_HOME/share/hadoop/common directory.  Can you tell me how you
 package your jar to obviate the need to put these jars here?

 Thanks

 --John

 On Sun, Aug 24, 2014 at 6:50 PM, Corey Nolet-2 [via Apache Accumulo] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=11303i=0 wrote:

 Awesome John! It's good to have this documented for future users. Keep us
 updated!


 On Sun, Aug 24, 2014 at 11:05 AM, JavaHokie [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11209i=0 wrote:

 Hi Corey,

 Just to wrap things up, AccumuloMultipeTableInputFormat is working really
 well.  This is an outstanding feature I can leverage big-time on my
 current
 work assignment, an IRAD I am working on, as well as my own prototype
 project.

 Thanks again for your help!

 --John



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11208.html
 Sent from the Users mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11209.html
  To unsubscribe from AccumuloMultiTableInputFormat IllegalStateException, 
 click
 here.
 NAML
 http://apache-accumulo.1065345.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: AccumuloMultiTableInputFormat
 IllegalStatementException
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11303.html

 Sent from the Users mailing list archive
 http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html at
 Nabble.com.



Re: Decouple topology configuration from code

2014-09-16 Thread Corey Nolet
Awhile ago I had written a camel adapter for storm so that spout inputs
could come from camel. Not sure how useful it would be for you but its
located here:

https://github.com/calrissian/storm-recipes/blob/master/camel/src/main/java/org/calrissian/recipes/camel/spout/CamelConsumerSpout.java

Hi folks,



Apache Camel has a number of DSL which allow its topologies (routes wrt.
Camel terminology) to be set up and configured easily.

I am interested in such approach for Storm.

I found java beans usage in:

https://github.com/granthenke/storm-spring/

but sounds fairly limited to me.



Is there any other DSL like initiative for Storm ?



My second concern is storm cluster management: we’d like to have a registry
of topologies and be able to
register/destroy/launch/suspend/kill/update registered topologies
using a REST API.



Is there any tool/initiative to support that ?



Thx,



/DV



*Dominique Villard*

*Architecte logiciel / Lead Developer*
Orange/OF/DTSI/DSI/DFY/SDFY

*tél. 04 97 46 30 03*
dominique.vill...@orange.com dominique.vill...@orange-ftgroup.com





_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les
messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere,
deforme ou falsifie. Merci.

This message and its attachments may contain confidential or
privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have
been modified, changed or falsified.
Thank you.


Re: Decouple topology configuration from code

2014-09-16 Thread Corey Nolet
Also, Trident is a DSL for rapidly producing useful analytics in Storm and
I've been working on a DSL that makes streams processing for complex event
processing possible.

That one is located here:

https://github.com/calrissian/flowmix
On Sep 16, 2014 4:29 AM, dominique.vill...@orange.com wrote:

  Hi folks,



 Apache Camel has a number of DSL which allow its topologies (routes wrt.
 Camel terminology) to be set up and configured easily.

 I am interested in such approach for Storm.

 I found java beans usage in:

 https://github.com/granthenke/storm-spring/

 but sounds fairly limited to me.



 Is there any other DSL like initiative for Storm ?



 My second concern is storm cluster management: we’d like to have a
 registry of topologies and be able to
 register/destroy/launch/suspend/kill/update registered topologies
 using a REST API.



 Is there any tool/initiative to support that ?



 Thx,



 /DV



 *Dominique Villard*

 *Architecte logiciel / Lead Developer*
 Orange/OF/DTSI/DSI/DFY/SDFY

 *tél. 04 97 46 30 03*
 dominique.vill...@orange.com dominique.vill...@orange-ftgroup.com





 _

 Ce message et ses pieces jointes peuvent contenir des informations 
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
 electroniques etant susceptibles d'alteration,
 Orange decline toute responsabilite si ce message a ete altere, deforme ou 
 falsifie. Merci.

 This message and its attachments may contain confidential or privileged 
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and delete 
 this message and its attachments.
 As emails may be altered, Orange is not liable for messages that have been 
 modified, changed or falsified.
 Thank you.




Re: Time to release 1.6.1?

2014-09-11 Thread Corey Nolet
I'm on it. I'll get a more formal vote going after I dig through the jira a
bit and note what's changed.

On Thu, Sep 11, 2014 at 11:06 AM, Christopher ctubb...@apache.org wrote:

 Also, we can always have a 1.6.2 if there's outstanding bugfixes to release
 later.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Sep 11, 2014 at 10:36 AM, Eric Newton eric.new...@gmail.com
 wrote:

  +1 for 1.6.1.
 
  There are people testing a recent 1.6 branch at scale (100s of nodes),
 with
  the intent of pushing it to production.
 
  I would rather have a released version in production.
 
  Thanks for volunteering.  Feel free to contact me if you need a hand with
  anything.
 
  -Eric
 
 
  On Wed, Sep 10, 2014 at 1:49 PM, Josh Elser josh.el...@gmail.com
 wrote:
 
   Sure that's fine, Corey. Happy to help coordinate things with you.
   *Hopefully* it's not too painful :)
  
  
   On 9/10/14, 10:43 AM, Corey Nolet wrote:
  
   I had posted this to the mailing list originally after a discussion
 with
   Christopher at the Accumulo Summit hack-a-thon and because I wanted to
  get
   into the release process to help out.
  
   Josh, I still wouldn't mind getting together 1.6.1 if that's okay with
   you.
   If nothing else, it would get someone else following the procedures
 and
   able to do the release.
  
   On Wed, Sep 10, 2014 at 1:22 PM, Josh Elser josh.el...@gmail.com
  wrote:
  
That's exactly my plan, Christopher. Keith has been the man working
 on
  a
   fix for ACCUMULO-1628 which is what I've been spinning on to get
 1.5.2
   out
   the door. I want to spend a little time today looking at his patch to
   understand the fix and run some tests myself. Hopefully John can
 retest
   the
   patch as well since he had an environment that could reproduce the
 bug.
  
   Right after we get 1.5.2, I'm happy to work on 1.6.1 as well.
  
   - Josh
  
  
   On 9/10/14, 10:04 AM, Christopher wrote:
  
Because of ACCUMULO-2988 (upgrade path from 1.4.x -- 1.6.y, y =
 1),
   I'm
   hoping we can revisit this soon. Maybe get 1.5.2 out the door,
  followed
   by
   1.6.1 right away.
  
  
   --
   Christopher L Tubbs II
   http://gravatar.com/ctubbsii
  
   On Fri, Jun 20, 2014 at 10:30 AM, Keith Turner ke...@deenlo.com
   wrote:
  
 On Thu, Jun 19, 2014 at 11:46 AM, Josh Elser 
 josh.el...@gmail.com
  
   wrote:
  
 I was thinking the same thing, but I also haven't made any
 strides
  
  
towards
  
getting 1.5.2 closer to happening (as I said I'd try to do).
  
   I still lack physical resources to do the week-long testing as
 our
   guidelines currently force us to do. I still think this testing is
   excessive if we're actually releasing bug-fixes, but it does
  
differentiate
  
us from other communities.
  
  
I want to run some CI test because of the changes I made w/
 walog.
   I can
   run the test, but I would like to do that as late as possible.
  Just
   let
   me know when you are thinking of cutting a release.
  
   Also, I would like to get 2827 in for the release.
  
  
  
I'm really not sure how to approach this which is really why I've
  been
   stalling on it.
  
  
   On 6/19/14, 7:18 AM, Mike Drob wrote:
  
 I'd like to see 1.5.2 released first, just in case there are
  issues
   we
  
   discover during that process that need to be addressed. Also, I
  think
   it
   would be useful to resolve the discussion surrounding upgrades[1]
   before
   releasing.
  
   [1]:
   http://mail-archives.apache.org/mod_mbox/accumulo-dev/
   201406.mbox/%3CCAGHyZ6LFuwH%3DqGF9JYpitOY9yYDG-
   sop9g6iq57VFPQRnzmyNQ%40mail.gmail.com%3E
  
  
   On Thu, Jun 19, 2014 at 8:09 AM, Corey Nolet cjno...@gmail.com
   wrote:
  
  I'd like to start getting a candidate together if there are no
  
objections.
  
   It looks like we have 65 resolved tickets with a fix version of
   1.6.1.
  
  
  
  
  
  
  
  
 



Re: Tablet server thrift issue

2014-09-01 Thread Corey Nolet
As an update,

I raised the tablet server memory and I have not seen this error thrown
since. I'd like to say raising the memory, alone, was the solution but it
appears that I also may be having some performance issues with the switches
connecting the racks together. I'll update more as I dive in further.


On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet cjno...@gmail.com wrote:

 Josh,

 Your advice is definitely useful- I also thought about catching the
 exception and retrying with a fresh batch writer but the fact that the
 batch writer failure doesn't go away without being re-instantiated is
 really only a nuisance. The TabletServerBatchWriter could be designed much
 better, I agree, but that is not the root of the problem.

 The Thrift exception that is causing the issue is what I'd like to get to
 the bottom of. It's throwing the following:

 *TApplicationException: applyUpdates failed: out of sequence response *

 I've never seen this exception before in regular use of the client API-
 but I also just updated to 1.6.0. Google isn't showing anything useful for
 how exactly this exception could come about other than using a bad
 threading model- and I don't see any drastic changes or other user
 complaints on the mailing list that would validate that line of thought.
 Quite frankly, I'm stumped. This could be a Thrift exception related to a
 Thrift bug or something bad on my system and have nothing to do with
 Accumulo.

 Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
 seen the exception before and may remember what it was/how they fixed it.


 On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote:

 Don't mean to tell you that I don't think there might be a bug/otherwise,
 that's pretty much just the limit of what I know about the server-side
 sessions :)

 If you have concrete this worked in 1.4.4 and this happens instead
 with 1.6.0, that'd make a great ticket :D

 The BatchWriter failure case is pretty rough, actually. Eric has made
 some changes to help already (in 1.6.1, I think), but it needs an overhaul
 that I haven't been able to make time to fix properly, either. IIRC, the
 only guarantee you have is that all mutations added before the last flush()
 happened are durable on the server. Anything else is a guess. I don't know
 the specifics, but that should be enough to work with (and saving off
 mutations shouldn't be too costly since they're stored serialized).


 On 8/22/14, 5:44 PM, Corey Nolet wrote:

 Thanks Josh,

 I understand about the session ID completely but the problem I have is
 that
 the exact same client code worked, line for line, just fine in 1.4.4 and
 it's acting up in 1.6.0. I also seem to remember the BatchWriter
 automatically creating a new session when one expired without an
 exception
 causing it to fail on the client.

 I know we've made changes since 1.4.4 but I'd like to troubleshoot the
 actual issue of the BatchWriter failing due to the thrift exception
 rather
 than just catching the exception and trying mutations again. The other
 issue is that I've already submitted a bunch of mutations to the batch
 writer from different threads. Does that mean I need to be storing them
 off
 twice? (once in the BatchWriter's cache and once in my own)

 The BatchWriter in my ingester is constantly sending data and the tablet
 servers have been given more than enough memory to be able to keep up.
 There's no swap being used and the network isn't experiencing any errors.


 On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com
 wrote:

  If you get an error from a BatchWriter, you pretty much have to throw
 away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
 If
 you want, you should be able to catch/recover from this without having
 to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

  Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built
 with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the
 ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out

Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-24 Thread Corey Nolet
I'm thinking this could be a yarn.application.classpath configuration
problem in your yarn-site.xml. I meant to ask earlier- how are you building
your jar that gets deployed? Are you shading it? Using libjars?



On Sun, Aug 24, 2014 at 6:56 AM, JavaHokie soozandjohny...@gmail.com
wrote:

 Hey Corey,

 Yah, sometimes ya just gotta go to the source code. :)

 It's a weird exception message...I am used to seeing NoClassDefFoundError
 and ClassNotFoundException.  It's also weird that the ReflectionException
 is
 no thrown, with NoClassDefFoundError or ClassNotFoundException as the root
 exception.

 Anyways, it's a classpath deal, but it's a weird one.  I thought maybe I
 had
 a 1.5 jar around somewhere, but the fact that the InputConfigurator--also
 new in Accumulo 1.6--can be found but InputTableConfig cannot is a bit
 puzzling.  But...us Java developers are used to figuring out classpath
 problems. Currently researching it on my end.

 Thanks again for all of your help so far--again, much appreciated.  Really
 excited to use this new feature.

 --John



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11204.html
 Sent from the Users mailing list archive at Nabble.com.



Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-24 Thread Corey Nolet
Awesome John! It's good to have this documented for future users. Keep us
updated!


On Sun, Aug 24, 2014 at 11:05 AM, JavaHokie soozandjohny...@gmail.com
wrote:

 Hi Corey,

 Just to wrap things up, AccumuloMultipeTableInputFormat is working really
 well.  This is an outstanding feature I can leverage big-time on my current
 work assignment, an IRAD I am working on, as well as my own prototype
 project.

 Thanks again for your help!

 --John



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11208.html
 Sent from the Users mailing list archive at Nabble.com.



Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-23 Thread Corey Nolet
Awesome! I was going to recommend checking out the code last night so that
you could put some logging statements in there. You've probably noticed
this already but the MapWritable does not have static type parameters so it
dumps out the fully qualified class name so that it can instantiate it back
using readFields() when it's deserializing.

That error is happening when the reflection is occurring- though it doesn't
make much sense. The Accumulo mapreduce packages are obviously on the
classpath. If you are still having this issue, I'll keep looking more into
this as well.


On Aug 23, 2014 2:37 PM, JavaHokie soozandjohny...@gmail.com wrote:

 I checked out 1.6.0 from git and updated the exception handling for the
 getInputTableConfigs method, rebuilt, and tested my M/R jobs that use
 Accumulo as a source or sink just to ensure everything is still working
 correctly.

 I then updated the InputConfigurator.getInputTableConfig exception handling
 and I see the root cause is as follows:

 java.io.IOException: can't find class:
 org.apache.accumulo.core.client.mapreduce.InputTableConfig because
 org.apache.accumulo.core.client.mapreduce.InputTableConfig
 at

 org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
 at
 org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:169)
 at

 org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.getInputTableConfigs(InputConfigurator.java:563)
 at

 org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:644)
 at

 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:342)
 at

 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:537)
 at

 org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)
 at

 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)

 The IOException can't find class classname because classname is a new
 one for me, but at least I have something specific to research.

 --John




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11202.html
 Sent from the Users mailing list archive at Nabble.com.



Re: Tablet server thrift issue

2014-08-22 Thread Corey Nolet
Thanks Josh,

I understand about the session ID completely but the problem I have is that
the exact same client code worked, line for line, just fine in 1.4.4 and
it's acting up in 1.6.0. I also seem to remember the BatchWriter
automatically creating a new session when one expired without an exception
causing it to fail on the client.

I know we've made changes since 1.4.4 but I'd like to troubleshoot the
actual issue of the BatchWriter failing due to the thrift exception rather
than just catching the exception and trying mutations again. The other
issue is that I've already submitted a bunch of mutations to the batch
writer from different threads. Does that mean I need to be storing them off
twice? (once in the BatchWriter's cache and once in my own)

The BatchWriter in my ingester is constantly sending data and the tablet
servers have been given more than enough memory to be able to keep up.
There's no swap being used and the network isn't experiencing any errors.


On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote:

 If you get an error from a BatchWriter, you pretty much have to throw away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990. If
 you want, you should be able to catch/recover from this without having to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

 Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out of sequence. In the TabletServer
 logs, I see an Invalid session id exception which happens only once
 before the client-side batch writer starts spitting out the MREs.

 I'm running some heavyweight processing in Storm along side the tablet
 servers. I shut that processing off in hopes that maybe it was the culprit
 but that hasn't fixed the issue.

 I'm surprised I haven't seen any other posts on the topic.

 Thanks!




Re: Tablet server thrift issue

2014-08-22 Thread Corey Nolet
Josh,

Your advice is definitely useful- I also thought about catching the
exception and retrying with a fresh batch writer but the fact that the
batch writer failure doesn't go away without being re-instantiated is
really only a nuisance. The TabletServerBatchWriter could be designed much
better, I agree, but that is not the root of the problem.

The Thrift exception that is causing the issue is what I'd like to get to
the bottom of. It's throwing the following:

*TApplicationException: applyUpdates failed: out of sequence response *

I've never seen this exception before in regular use of the client API- but
I also just updated to 1.6.0. Google isn't showing anything useful for how
exactly this exception could come about other than using a bad threading
model- and I don't see any drastic changes or other user complaints on the
mailing list that would validate that line of thought. Quite frankly, I'm
stumped. This could be a Thrift exception related to a Thrift bug or
something bad on my system and have nothing to do with Accumulo.

Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
seen the exception before and may remember what it was/how they fixed it.


On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser josh.el...@gmail.com wrote:

 Don't mean to tell you that I don't think there might be a bug/otherwise,
 that's pretty much just the limit of what I know about the server-side
 sessions :)

 If you have concrete this worked in 1.4.4 and this happens instead with
 1.6.0, that'd make a great ticket :D

 The BatchWriter failure case is pretty rough, actually. Eric has made some
 changes to help already (in 1.6.1, I think), but it needs an overhaul that
 I haven't been able to make time to fix properly, either. IIRC, the only
 guarantee you have is that all mutations added before the last flush()
 happened are durable on the server. Anything else is a guess. I don't know
 the specifics, but that should be enough to work with (and saving off
 mutations shouldn't be too costly since they're stored serialized).


 On 8/22/14, 5:44 PM, Corey Nolet wrote:

 Thanks Josh,

 I understand about the session ID completely but the problem I have is
 that
 the exact same client code worked, line for line, just fine in 1.4.4 and
 it's acting up in 1.6.0. I also seem to remember the BatchWriter
 automatically creating a new session when one expired without an exception
 causing it to fail on the client.

 I know we've made changes since 1.4.4 but I'd like to troubleshoot the
 actual issue of the BatchWriter failing due to the thrift exception rather
 than just catching the exception and trying mutations again. The other
 issue is that I've already submitted a bunch of mutations to the batch
 writer from different threads. Does that mean I need to be storing them
 off
 twice? (once in the BatchWriter's cache and once in my own)

 The BatchWriter in my ingester is constantly sending data and the tablet
 servers have been given more than enough memory to be able to keep up.
 There's no swap being used and the network isn't experiencing any errors.


 On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser josh.el...@gmail.com wrote:

  If you get an error from a BatchWriter, you pretty much have to throw
 away
 that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
 If
 you want, you should be able to catch/recover from this without having to
 restart the ingester.

 If the session ID is invalid, my guess is that it hasn't been used
 recently and the tserver cleaned it up. The exception logic isn't the
 greatest (as it just is presented to you as a RTE).

 https://issues.apache.org/jira/browse/ACCUMULO-2990


 On 8/22/14, 4:35 PM, Corey Nolet wrote:

  Eric  Keith, Chris mentioned to me that you guys have seen this issue
 before. Any ideas from anyone else are much appreciated as well.

 I recently updated a project's dependencies to Accumulo 1.6.0 built with
 Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
 component which is running all the time with a batch writer using many
 threads to push mutations into Accumulo.

 The issue I'm having is a show stopper. At different intervals of time,
 sometimes an hour, sometimes 30 minutes, I'm getting
 MutationsRejectedExceptions (server errors) from the
 TabletServerBatchWriter. Once they start, I need to restart the ingester
 to
 get them to stop. They always come back within 30 minutes to an hour...
 rinse, repeat.

 The exception always happens on different tablet servers. It's a thrift
 error saying a message was received out of sequence. In the TabletServer
 logs, I see an Invalid session id exception which happens only once
 before the client-side batch writer starts spitting out the MREs.

 I'm running some heavyweight processing in Storm along side the tablet
 servers. I shut that processing off in hopes that maybe it was the
 culprit
 but that hasn't fixed the issue.

 I'm surprised I haven't seen any other posts on the topic

Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-22 Thread Corey Nolet
Hey John,

Could you give an example of one of the ranges you are using which causes
this to happen?


On Fri, Aug 22, 2014 at 11:02 PM, John Yost soozandjohny...@gmail.com
wrote:

 Hey Everyone,

 The AccumuloMultiTableInputFormat is an awesome addition to the Accumulo
 API and I am really excited to start using it.

 My first attempt with the 1.6.0 release resulted in this
 IllegalStateException:

 java.lang.IllegalStateException: The table query configurations could not
 be deserialized from the given configuration

 at
 org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.getInputTableConfigs(InputConfigurator.java:566)

 at
 org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:628)

 at
 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:342)

 at
 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:537)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)

 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)

 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:415)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)

 at
 com.johnyostanalytics.mapreduce.client.TwitterJoin.run(TwitterJoin.java:104)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

 when I attempt to initialize the AccumuloMultiTableInputFormat:

 InputTableConfig baseConfig = new InputTableConfig();
 baseConfig.setRanges(ranges);

 InputTableConfig edgeConfig = new InputTableConfig();
 edgeConfig.setRanges(ranges);
 configs.put(base, baseConfig);
 configs.put(edges,edgeConfig);

 AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs);

 Any ideas as to what may be going on?  I know that the table names are
 valid and that the Range objects are valid because I tested all of that
 independently via Accumulo scans.

 Any guidance is greatly appreciated because, again,
 AcumuloMultiTableInputFormat is really cool and I am really looking forward
 to using it.

 Thanks

 --John



















Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-22 Thread Corey Nolet
The table configs get serialized as base64 and placed in the job's
Configuration under the key AccumuloInputFormat.ScanOpts.TableConfigs.
Could you verify/print what's being placed in this key in your
configuration?



On Sat, Aug 23, 2014 at 12:15 AM, JavaHokie soozandjohny...@gmail.com
wrote:

 Hey Corey,

 Sure thing!  Here is my code:

 MapString,InputTableConfig configs = new
 HashMapString,InputTableConfig();

 ListRange ranges = Lists.newArrayList(new
 Range(104587),new
 Range(105255));

 InputTableConfig edgeConfig = new InputTableConfig();
 edgeConfig.setRanges(ranges);

 InputTableConfig followerConfig = new InputTableConfig();
 followerConfig.setRanges(ranges);

 configs.put(following,followerConfig);
 configs.put(twitteredges,edgeConfig);

 These are the row values I am using to join entries from the following and
 twitteredges tables.

 --John



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11189.html
 Sent from the Users mailing list archive at Nabble.com.



Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-22 Thread Corey Nolet
The tests I'm running aren't using the native Hadoop libs either. If you
don't mind,  a little more code as to how you are setting up your job would
be useful. That's weird the key in the config would be null. Are you using
the job.getConfiguration()?


On Sat, Aug 23, 2014 at 12:31 AM, JavaHokie soozandjohny...@gmail.com
wrote:

 Hey Corey,

 Gotcha, i get a null when I attempt to log the value:

 log.debug(configuration.get(AccumuloInputFormat.ScanOpts.TableConfigs));

 --John



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11191.html
 Sent from the Users mailing list archive at Nabble.com.



Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-22 Thread Corey Nolet
Also, if you don't mind me asking, why isn't your job setup class extending
Configured? That was you are picking up configurations injected from the
environment.

You would do MyJobSetUpClass extends Configured Then use getConf()
instead of newing up a new configuration.


On Sat, Aug 23, 2014 at 1:11 AM, Corey Nolet cjno...@gmail.com wrote:

 Job.getInstance(configuration) copies the configuration and makes its own.
 Try doing your debug statement from earlier on job.getConfiguration() and
 let's see what the base64 string looks like.



 On Sat, Aug 23, 2014 at 1:00 AM, JavaHokie soozandjohny...@gmail.com
 wrote:

 Sure thing, here's my run method implementation:

 Configuration configuration = new Configuration();

 configuration.set(fs.defaultFS, hdfs://127.0.0.1:8020);
 configuration.set(mapreduce.job.tracker, localhost:54311);
 configuration.set(mapreduce.framework.name, yarn);
 configuration.set(yarn.resourcemanager.address,
 localhost:8032);

 Job job = Job.getInstance(configuration);

 /*
  * Set the basic stuff
  */
 job.setJobName(TwitterJoin Query);
 job.setJarByClass(TwitterJoin.class);

 /*
  * Set Mapper and Reducer Classes
  */
 job.setMapperClass(TwitterJoinMapper.class);
 job.setReducerClass(TwitterJoinReducer.class);

 /*
  * Set the Mapper MapOutputKeyClass and MapOutputValueClass
  */
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);

 /*
  * Set the Reducer OutputKeyClass and OutputValueClass
  */
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Mutation.class);

 /*
  * Set InputFormat and OutputFormat classes
  */

 job.setInputFormatClass(AccumuloMultiTableInputFormat.class);
 job.setOutputFormatClass(AccumuloOutputFormat.class);

 /*
  * Configure InputFormat and OutputFormat Classes
  */
 MapString,InputTableConfig configs = new
 HashMapString,InputTableConfig();

 ListRange ranges = Lists.newArrayList(new
 Range(104587),new
 Range(105255));

 InputTableConfig edgeConfig = new InputTableConfig();
 edgeConfig.setRanges(ranges);
 edgeConfig.setAutoAdjustRanges(true);

 InputTableConfig followerConfig = new InputTableConfig();
 followerConfig.setRanges(ranges);
 followerConfig.setAutoAdjustRanges(true);

 configs.put(following,followerConfig);
 configs.put(twitteredges,edgeConfig);


 AccumuloMultiTableInputFormat.setConnectorInfo(job,root,new
 PasswordToken(.getBytes()));


 AccumuloMultiTableInputFormat.setZooKeeperInstance(job,localhost,localhost);

 AccumuloMultiTableInputFormat.setScanAuthorizations(job,new
 Authorizations(private));
 AccumuloMultiTableInputFormat.setInputTableConfigs(job,
 configs);


 AccumuloOutputFormat.setZooKeeperInstance(job,localhost,localhost);
 AccumuloOutputFormat.setConnectorInfo(job,root,new
 PasswordToken(.getBytes()));
 AccumuloOutputFormat.setCreateTables(job,true);

 AccumuloOutputFormat.setDefaultTableName(job,twitteredgerollup);

 /*
  * Kick off the job, wait for completion, and return
 applicable code
  */
 boolean success = job.waitForCompletion(true);

 if (success) {
 return 0;
 }

 return 1;
 }



 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11193.html
 Sent from the Users mailing list archive at Nabble.com.





Re: AccumuloMultiTableInputFormat IllegalStatementException

2014-08-22 Thread Corey Nolet
That code I posted should be able to validate where you are getting hung
up. Can you try running that on the machine and seeing if it prints the
expected tables/ranges?


Also, are you running the job live? What does the configuration look like
for the job on your resource manager? Can you see if the base64 matches?


On Sat, Aug 23, 2014 at 1:47 AM, JavaHokie soozandjohny...@gmail.com
wrote:

 H...the byte[] array is generated OK.

 byte[] bytes =
 Base64.decodeBase64(configString.getBytes(StandardCharsets.UTF_8));

 I wonder what's golng wrong with one of these lines below?

 ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
 mapWritable.readFields(new DataInputStream(bais));




 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/AccumuloMultiTableInputFormat-IllegalStateException-tp11186p11199.html
 Sent from the Users mailing list archive at Nabble.com.



Re: Kafka + Storm

2014-08-14 Thread Corey Nolet
Kafka is also distributed in nature, which is not something easily achieved
by queuing brokers like ActiveMQ or JMS (1.0) in general. Kafka allows data
to be partitioned across many machines which can grow as necessary as your
data grows.




On Thu, Aug 14, 2014 at 11:20 PM, Justin Workman justinjwork...@gmail.com
wrote:

 Absolutely!

 Sent from my iPhone

 On Aug 14, 2014, at 9:02 PM, anand nalya anand.na...@gmail.com wrote:

 I agree, not for the long run but for small bursts in data production
 rate, say peak hours, Kafka can help in providing a somewhat consistent
 load on Storm cluster.
 --
 From: Justin Workman justinjwork...@gmail.com
 Sent: ‎15-‎08-‎2014 07:53
 To: user@storm.incubator.apache.org
 Subject: Re: Kafka + Storm

 I suppose not directly.  It depends on the lifetime of your Kafka queues
 and on your latency requirements. You need to make sure you have enough
 doctors or in storm language workers, in your storm cluster to process
 your messages within your SLA.

 For our case we, we have a 3 hour lifetime or ttl configured for our
 queues. Meaning records in the queue older than 3 hours are purged. We also
 have an internal SLA ( team goal, not published to the business ;)) of 10
 seconds from event to end of stream and available for end user consumption.

 So we need to make sure we have enough storm workers to to meet; 1) the
 normal SLA and 2) be able to catch up on the queues when we have to take
 storm down for maintenance and such and the queues build.

 There are many knobs you can tune for both storm and Kafka. We have spent
 many hours tuning things to meet our SLAs.

 Justin

 Sent from my iPhone

 On Aug 14, 2014, at 8:05 PM, anand nalya anand.na...@gmail.com wrote:

 Also, since Kafka acts as a buffer, storm is not directly affected by the
 speed of your data sources/producers.
 --
 From: Justin Workman justinjwork...@gmail.com
 Sent: ‎15-‎08-‎2014 07:12
 To: user@storm.incubator.apache.org
 Subject: Re: Kafka + Storm

 Good analogy!

 Sent from my iPhone

 On Aug 14, 2014, at 7:36 PM, Adaryl \Bob\ Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

  Ah so Storm is the hospital and Kafka is the waiting room where
 everybody queues up to be seen in turn yes?

 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData

  *From:* Justin Workman justinjwork...@gmail.com
 *Sent:* Thursday, August 14, 2014 7:47 PM
 *To:* user@storm.incubator.apache.org
 *Subject:* Re: Kafka + Storm

  If you are familiar with Weblogic or ActiveMQ, it is similar. Let's see
 if I can explain, I am definitely not a subject matter expert on this.

 Within Kafka you can create queues, ie a webclicks queue. Your web
 servers can then send click events to this queue in Kafka. The web servers,
 or agent writing the events to this queue are referred to as the
 producer.  Each event, or message in Kafka is assigned an id.

 On the other side there are consumers, in storms case this would be the
 storm Kafka spout, that can subscribe to this webclicks queue to consume
 the messages that are in the queue. The consumer can consume a single
 message from the queue, or a batch of messages, as storm does. The consumer
 keeps track of the latest offset, Kafka message id, that it has consumed.
 This way the next time the consumer checks to see if there are more
 messages to consume it will ask for messages with a message id greater than
 its last offset.

 This helps with the reliability of the event stream and helps guarantee
 that your events/message make it start to finish through your stream,
 assuming the events get to Kafka ;)

 Hope this helps and makes some sort of sense. Again, sent from my iPhone ;)

 Justin

 Sent from my iPhone

 On Aug 14, 2014, at 6:28 PM, Adaryl \Bob\ Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   I get your reasoning at a high level. I should have specified that I
 wasn’t sure what Kafka does. I don’t have a hard software engineering
 background. I know that Kafka is “a message queuing” system, but I don’t
 really know what that means.

 (I can’t believe you wrote all that from your iPhone)
 B.


  *From:* Justin Workman justinjwork...@gmail.com
 *Sent:* Thursday, August 14, 2014 7:22 PM
 *To:* user@storm.incubator.apache.org
 *Subject:* Re: Kafka + Storm

  Personally, we looked at several options, including writing our own
 storm source. There are limited storm sources with community support out
 there. For us, it boiled down to the following;

 1) community support and what appeared to be a standard method. Storm has
 now included the kafka source as a bundled component to storm. This made
 the implementation much faster, because the code was done.
 2) the durability (replication and clustering) of Kafka. We have a three
 hour retention period on our queues, so if we need to do maintenance on
 storm or deploy an updated topology, 

Re: Good way to test when topology in local cluster is fully active

2014-08-05 Thread Corey Nolet
Vincent  P.Taylor,

I played with the testing framework for a little bit last night and don't
see any easy way to provide pauses in between the emissions of the mock
tuples. For instance, my sliding window semantics are heavily orchestrated
by time evictions and triggers which mean that I need to be able to time
the tuples being fed into the tests (i.e. emit a tuple every 500ms and run
the test for 25 secons).


On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 My guess is that the slowdown you are seeing is a result of the new
 version of ZooKeeper and how it handles IPv4/6.

 Try adding the following JVM parameter when running your tests:

 -Djava.net.preferIPv4Stack=true

 -Taylor

 On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote:

  I'm testing some sliding window algorithms with tuples emitted from a
 mock spout based on a timer but the amount of time it takes the topology to
 fully start up and activate seems to vary from computer to computer.
 Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my
 tests are breaking because the time to activate the topology is taking
 longer (because of Netty possibly?). I'd like to make my tests more
 resilient to things like this.
 
  Is there something I can look at in LocalCluster where I could do
 while(!notActive) { Thread.sleep(50) } ?
 
  This is what my test looks like currently:
 
StormTopology topology = buildTopology(...);
Config conf = new Config();
conf.setNumWorkers(1);
 
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(getTopologyName(), conf, topology);
 
try {
  Thread.sleep(4000);
} catch (InterruptedException e) {
  e.printStackTrace();
}
 
cluster.shutdown();
 
assertEquals(4, MockSinkBolt.getEvents().size());
 
 
 
  Thanks!
 
 




Re: Good way to test when topology in local cluster is fully active

2014-08-05 Thread Corey Nolet
Sorry- the ipv4 fix worked.


On Tue, Aug 5, 2014 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:

 This did work. Thanks!


 On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 My guess is that the slowdown you are seeing is a result of the new
 version of ZooKeeper and how it handles IPv4/6.

 Try adding the following JVM parameter when running your tests:

 -Djava.net.preferIPv4Stack=true

 -Taylor

 On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote:

  I'm testing some sliding window algorithms with tuples emitted from a
 mock spout based on a timer but the amount of time it takes the topology to
 fully start up and activate seems to vary from computer to computer.
 Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my
 tests are breaking because the time to activate the topology is taking
 longer (because of Netty possibly?). I'd like to make my tests more
 resilient to things like this.
 
  Is there something I can look at in LocalCluster where I could do
 while(!notActive) { Thread.sleep(50) } ?
 
  This is what my test looks like currently:
 
StormTopology topology = buildTopology(...);
Config conf = new Config();
conf.setNumWorkers(1);
 
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(getTopologyName(), conf, topology);
 
try {
  Thread.sleep(4000);
} catch (InterruptedException e) {
  e.printStackTrace();
}
 
cluster.shutdown();
 
assertEquals(4, MockSinkBolt.getEvents().size());
 
 
 
  Thanks!
 
 





Re: Good way to test when topology in local cluster is fully active

2014-08-05 Thread Corey Nolet
This did work. Thanks!


On Tue, Aug 5, 2014 at 2:23 PM, P. Taylor Goetz ptgo...@gmail.com wrote:

 My guess is that the slowdown you are seeing is a result of the new
 version of ZooKeeper and how it handles IPv4/6.

 Try adding the following JVM parameter when running your tests:

 -Djava.net.preferIPv4Stack=true

 -Taylor

 On Aug 4, 2014, at 8:49 PM, Corey Nolet cjno...@gmail.com wrote:

  I'm testing some sliding window algorithms with tuples emitted from a
 mock spout based on a timer but the amount of time it takes the topology to
 fully start up and activate seems to vary from computer to computer.
 Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my
 tests are breaking because the time to activate the topology is taking
 longer (because of Netty possibly?). I'd like to make my tests more
 resilient to things like this.
 
  Is there something I can look at in LocalCluster where I could do
 while(!notActive) { Thread.sleep(50) } ?
 
  This is what my test looks like currently:
 
StormTopology topology = buildTopology(...);
Config conf = new Config();
conf.setNumWorkers(1);
 
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(getTopologyName(), conf, topology);
 
try {
  Thread.sleep(4000);
} catch (InterruptedException e) {
  e.printStackTrace();
}
 
cluster.shutdown();
 
assertEquals(4, MockSinkBolt.getEvents().size());
 
 
 
  Thanks!
 
 




Re: Good way to test when topology in local cluster is fully active

2014-08-04 Thread Corey Nolet
Oh Nice. Is this new in 0.9.*? I just updated so I haven't looked much into
what's changed yet, other than Netty.


On Mon, Aug 4, 2014 at 10:40 PM, Vincent Russell vincent.russ...@gmail.com
wrote:

 Corey,

 Have you tried using the integration testing framework that comes with
 storm?


 Testing.withSimulatedTimeLocalCluster(mkClusterParam,
  new TestJob() {
 @Override
 public void run(ILocalCluster cluster) throws Exception {

 CompleteTopologyParam completeTopologyParam = new CompleteTopologyParam();
 completeTopologyParam
 .setMockedSources(mockedSources);
  completeTopologyParam.setStormConf(daemonConf);

 completeTopologyParam.setTopologyName(getTopologyName());
 Map result = Testing.completeTopology(cluster,
  topology, completeTopologyParam);

 });

 -Vincent

 On Mon, Aug 4, 2014 at 8:49 PM, Corey Nolet cjno...@gmail.com wrote:

 I'm testing some sliding window algorithms with tuples emitted from a
 mock spout based on a timer but the amount of time it takes the topology to
 fully start up and activate seems to vary from computer to computer.
 Specifically, I just updated from 0.8.2 to 0.9.2-incubating and all of my
 tests are breaking because the time to activate the topology is taking
 longer (because of Netty possibly?). I'd like to make my tests more
 resilient to things like this.

 Is there something I can look at in LocalCluster where I could do
 while(!notActive) { Thread.sleep(50) } ?

 This is what my test looks like currently:

   StormTopology topology = buildTopology(...);
   Config conf = new Config();
   conf.setNumWorkers(1);

   LocalCluster cluster = new LocalCluster();
   cluster.submitTopology(getTopologyName(), conf, topology);

   try {
 Thread.sleep(4000);
   } catch (InterruptedException e) {
 e.printStackTrace();
   }

   cluster.shutdown();

   assertEquals(4, MockSinkBolt.getEvents().size());



 Thanks!






<    1   2   3   4   5   >