I beat on the second one. I need to pack and catch a train, else I would snag the other as well.
One thing I did notice on 1163 is that there are some spots where you could use guava to advantage. For instance, one should pretty much never use Charset.defaultCharset() for machine readable files. Use Charsets.UTF_8 instead. On Wed, Mar 27, 2013 at 7:47 AM, Marty Kube <[email protected]>wrote: > Hey Ted, > > Here are the JIRA tickets... > > https://issues.apache.org/**jira/browse/MAHOUT-1163<https://issues.apache.org/jira/browse/MAHOUT-1163> > https://issues.apache.org/**jira/browse/MAHOUT-1164<https://issues.apache.org/jira/browse/MAHOUT-1164> > > > > > On 03/27/2013 12:37 AM, Ted Dunning wrote: > >> Can you post a list of those patches? >> >> I haven't been tracking carefully and unless I have a moment when the >> email >> comes through (<10% chance lately) then I lose track. >> >> On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube <[email protected]> >> **wrote: >> >> So I'd like to continue to improve the RF classifier code. I've been >>> posting patches along the lines of the refactoring discussed here. The >>> patches are not being looked at. Someone should be considering patches in >>> this area. Maybe I could handle that :-) >>> >>> >>> Sent from my iPhone >>> >>> On Mar 27, 2013, at 12:14 AM, Sebastian Schelter <[email protected]> wrote: >>> >>> Totally agree on that. The impact of making Mahout more usable is much >>>> higher than that of adding a new algorithm. >>>> >>>> On 27.03.2013 05:41, Ted Dunning wrote: >>>> >>>>> It is critically important. >>>>> >>>>> On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < >>>>> martykube@**beavercreekconsulting.com<[email protected]>> >>>>> wrote: >>>>> >>>>> IMHO usability is really important. I've posted a couple of patches >>>>>> recently around making the RF classifiers easier to use. I found >>>>>> >>>>> myself >>> >>>> working on consistent data format and command line option support. >>>>>> >>>>> It's not >>> >>>> glamorous but it's important. >>>>>> >>>>>> >>>>>> On 3/26/2013 8:26 PM, Ted Dunning wrote: >>>>>> >>>>>> Gokhan, >>>>>>> >>>>>>> I think that the general drift of your recommendation is an excellent >>>>>>> suggestion and it is something that we have wrestled with a lot over >>>>>>> >>>>>> time. >>> >>>> The recommendations side of the house has more coherence in this >>>>>>> >>>>>> matter >>> >>>> than other parts largely because there was a clear flow early on. >>>>>>> >>>>>>> Now, however, the flow is becoming more clear for non-recommendation >>>>>>> >>>>>> parts >>> >>>> of the system. >>>>>>> >>>>>>> - we have 2-3 external kinds of input. These include text and >>>>>>> >>>>>> matrices. >>> >>>> Text comes in two major forms, those being text in files with >>>>>>> unspecified >>>>>>> separators and text in Lucene/Solr indexes. Matrices come in several >>>>>>> forms >>>>>>> including triples, CSV files, binary matrices and sequence files of >>>>>>> vectors. >>>>>>> >>>>>>> - there are currently only a few ways to convert text and external >>>>>>> >>>>>> data to >>> >>>> matrices. The two most prominent are dictionary based and hashed >>>>>>> encoding. >>>>>>> Hashed encoding is currently not as invertible as it should be. >>>>>>> Dictionary based has the virtue of being invertible, but hashed >>>>>>> >>>>>> encoding >>> >>>> has considerably more generality. We have almost no support for >>>>>>> >>>>>> multiple >>> >>>> fields in dictionary based encoding. >>>>>>> >>>>>>> - good conversion backwards and forwards depends on having schema >>>>>>> information that we don't retain or specify well. >>>>>>> >>>>>>> - knowledge discovery pathways need more flexibility than >>>>>>> >>>>>> recommendation >>> >>>> pathways regarding input and visualization. >>>>>>> >>>>>>> - key knowledge discovery pathways that I know about include (a) >>>>>>> input >>>>>>> summarization, (b) vectorization, (c) unsupervised analysis such as >>>>>>> >>>>>> LDA, >>> >>>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive >>>>>>> >>>>>> Bayes and >>> >>>> random forest, and (e) visualization of results >>>>>>> >>>>>>> I see that the major problems in Mahout are what Gokhan said, but >>>>>>> >>>>>> with a >>> >>>> few extras >>>>>>> >>>>>>> 1) as Gokhan said, the exploratory pathways are inconsistent >>>>>>> >>>>>>> 2) I think that our visualization pathways are also hideous >>>>>>> >>>>>>> 3) I think that we need a good document format with a reasonable >>>>>>> >>>>>> schema. >>> >>>> Rather than create such a thing, I would nominate Lucene/Solr indexes >>>>>>> as a >>>>>>> first class object in Mahout. >>>>>>> >>>>>>> 4) our current command lines with all the (many) different options >>>>>>> >>>>>> with >>> >>>> incompatible conventions is a bit of a shambles >>>>>>> >>>>>>> Expressed this way, I think that these usability issues are fixable. >>>>>>> >>>>>>> What does everybody else think? Would this leave us with a >>>>>>> >>>>>> significantly >>> >>>> better system? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]> >>>>>>> >>>>>> wrote: >>> >>>> I am moving my email that I wrote to Call to Action upon request. >>>>>>> >>>>>>>> I'll start with an example that I experience when I use Mahout, and >>>>>>>> >>>>>>> list >>> >>>> my >>>>>>>> humble suggestions. >>>>>>>> >>>>>>>> When I try to run Latent Dirichlet Allocation for topic discovery, >>>>>>>> >>>>>>> here >>> >>>> are >>>>>>>> the steps to follow: >>>>>>>> >>>>>>>> 1- First I use seq2sparse to convert text to vectors. The output is >>>>>>>> >>>>>>> Text, >>> >>>> VectorWritable pairs (If I have a csv data file --which is >>>>>>>> >>>>>>>> understandable-, >>>>>>>> which has lines of id, text pairs, I need to develop my own tool to >>>>>>>> convert >>>>>>>> it to vectors.) >>>>>>>> >>>>>>>> 2- I run LDA on data I transformed, but it doesn't work, because LDA >>>>>>>> needs >>>>>>>> IntWritable, VectorWritable pairs. >>>>>>>> >>>>>>>> 3- I convert Text keys to IntWritable ones with a custom tool. >>>>>>>> >>>>>>>> 4- Then I run LDA, and to see the results, I need to run vectordump >>>>>>>> >>>>>>> with >>> >>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does >>>>>>>> >>>>>>> not >>> >>>> exist. What I see is fairly different from clusterdump results, so I >>>>>>>> spend >>>>>>>> some time to understand what that means. (And I need to know there >>>>>>>> exists a >>>>>>>> vectordump tool to see the results) >>>>>>>> >>>>>>>> 5- After running LDA, when I have a document that I want to assign >>>>>>>> >>>>>>> to a >>> >>>> topic, there is no way -or I am not aware- to use my learned LDA >>>>>>>> >>>>>>> model to >>> >>>> assign this document to a topic. >>>>>>>> >>>>>>>> I can give further examples, but I believe this will make my point >>>>>>>> >>>>>>> clear. >>> >>>> >>>>>>>> Would you consider to refactor Mahout, so that the project follows a >>>>>>>> clear, >>>>>>>> layered structure for all algorithms, and to document it? >>>>>>>> >>>>>>>> IMO Knowledge Discovery process has a certain path, and Mahout can >>>>>>>> >>>>>>> define >>> >>>> rules, those would force developers and guide users. For example: >>>>>>>> >>>>>>>> >>>>>>>> - All algorithms take Mahout matrices as input and output. >>>>>>>> - All preprocessing tools should be generic enough, so that they >>>>>>>> produce >>>>>>>> appropriate input for mahout algorithms. >>>>>>>> - All algorithms should output a model that users can use them >>>>>>>> >>>>>>> beyond >>> >>>> training and testing >>>>>>>> - Tools those dump results should follow a strictly defined >>>>>>>> format >>>>>>>> suggested by community >>>>>>>> - All similar kinds of algorithms should use same evaluation >>>>>>>> tools >>>>>>>> - ... >>>>>>>> >>>>>>>> There may be separated layers: preprocessing layer, algorithms >>>>>>>> layer, >>>>>>>> evaluation layer, and so on. >>>>>>>> >>>>>>>> This way users would be aware of the steps they need to perform, and >>>>>>>> >>>>>>> one >>> >>>> step can be replaced by an alternative. >>>>>>>> >>>>>>>> Developers would contribute to the layer they feel comfortable with, >>>>>>>> >>>>>>> and >>> >>>> would satisfy the expected input and output, to preserve the >>>>>>>> >>>>>>> integrity. >>> >>>> Mahout has tools for nearly all of these layers, but personally when >>>>>>>> >>>>>>> I >>> >>>> use >>>>>>>> Mahout (and I've been using it for a long time), I feel lost in the >>>>>>>> steps I >>>>>>>> should follow. >>>>>>>> >>>>>>>> Moreover, the refactoring may eliminate duplicate data structures, >>>>>>>> >>>>>>> and >>> >>>> stick to Mahout matrices if available. All similarity measures >>>>>>>> >>>>>>> operate on >>> >>>> Mahout Vectors, for example. >>>>>>>> >>>>>>>> We, in the lab and in our company, do some of that. An example: >>>>>>>> >>>>>>>> We implemented an HBase backed Mahout Matrix, which we use for our >>>>>>>> projects >>>>>>>> where online learning algorithms operate on large input and learn a >>>>>>>> >>>>>>> big >>> >>>> parameter matrix (one needs this for matrix factorization based >>>>>>>> recommenders). Then the persistent parameter matrix becomes an input >>>>>>>> >>>>>>> for >>> >>>> the live system. Then we used the same matrix implementation as the >>>>>>>> underlying data store of Recommender DataModels. This was >>>>>>>> >>>>>>> advantageous in >>> >>>> many ways: >>>>>>>> >>>>>>>> - Everyone knows that any dataset should be in Mahout matrix >>>>>>>> >>>>>>> format, >>> >>>> and >>>>>>>> applies appropriate preprocessing, or writes one >>>>>>>> - We can use different recommenders interchangeably >>>>>>>> - Any optimization on matrix operations apply everywhere >>>>>>>> - Different people can work on different parts (evaluation, >>>>>>>> model >>>>>>>> optimization, recommender algorithms) without bothering others >>>>>>>> >>>>>>>> Apart from all, I should say that I am always eager to contribute to >>>>>>>> Mahout, as some of committers already know. >>>>>>>> >>>>>>>> Best Regards >>>>>>>> >>>>>>>> Gokhan >>>>>>>> >>>>>>> >
