Yes to splitting builds into legacy and scala (mostly). D can speak to his 
stuff better but It sounds like the Java Math module will be required but 
nothing from mrlegacy afaik. So a legacy and ??? build would overlap in the one 
module. We talked about using sbt but not sure that’s required for a 
release—what ever is easiest. I’d rather make a clean break and use the most 
fit tools since we are all-in with Scala but I don’t know if there’s any need 
to rush.

+1 to Jira cleanup.

On Mar 5, 2015, at 10:20 AM, Andrew Palumbo <ap....@outlook.com> wrote:


I agree as well with pretty much everything.  Though I'm not sure exactly what 
you mean by a split between mrlegacy and scala. If you're talking about 
complete compartmentalization between these two sides of the project for this 
release, I'm all for it.  I think the only intersection at the moment is in 
spark-shell and spark (there's nothing in math right?), and should they be 
completely split with Dmitriy's refactoring (he said he was working on this). 

>>3) the release build is completely broken. No artifacts are created for 
>>scala, spark, or h2o. No hosted scaladocs are created afaik.

  right now i think only mrlegacy docs are being published.    

>> 4) Naive Bayes only partial pipeline for text classification is implemented 
>> in Scala but NB itself is working, TD-IDF in progress
>> 4) finish the text pipeline
>>+1, would explore the new text processing features available in Lucene 5. 
>>Please don't go by how MlLib does this

agreed.  also +1 for Lucene for text processing.

I've been looking into this and we talked a bit about implementing a Lucene 
analyzer based vectorizer for text in the spark module.  I've been thinking 
about trying to work something up that would support both IndexedDatasets and 
SchemaRDDs but have to get more familiar with both.

>> 5) There is some distributed aggregation work that is waiting in a PR and 
>> seems to be stalled. I’d vote to see this included.
also +1 

>> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than 
>> anything like what Mahout is today.
+1 -- very important. we should really have a template or some standard to make 
the docs easier to follow.

>> 1) more stats and polish to the shell (savable workspaces, etc)
+1 to visualization here also. 

Yes also agree to getting a release out.  We do have several MRLegacy bugfixes 
as well.   I counted 45 the other day since the last release with several in 
the JIRA- been meaning to post to the dev list about this .  

I think that it would good be for us to get back to using JIRA more regularly 
again with a release coming up also. (and to clean out the backlog of won't 
fixes)

Reply via email to