Hi Ted, yes that helps, i think its going to take me a while to get my head around your suggestions and as a result as I start building a proof of concept ill probably have quite a few more questions but this gives me a good starting point.
I will try to give feedback about my experiences to Regards, Dave On 17 Feb 2010, at 20:22, Ted Dunning wrote: > I think I understand your question. To make sure, here it is in my terms: > > - you have documents with tag tokens in the fid field > > - you have a bunch of rules for defining which documents appear where in > your hierarchy. These rules are defined as Lucene queries. > > - when you get a new document, it is slow to run every one of these queries > against the new document. > > - you would like to run these queries very quickly in order to update your > hierarchy quickly and to provide author feedback. Using ML would be a > spiffy way to do this and might provide hints for updating your hierarchy > rules. > > > My first suggestion for you would be to consider building a one document > index for the author feedback situation. Running all of your rules against > that index should be pretty darned fast. That doesn't help with some of the > other issues and might be hard to do with solr, but it would be easy with > raw Lucene. You should be able to run several thousands of rules per second > this way. > > That doesn't answer the question you asked, though. The answer there, is > yes. Definitely. There are a number of machine learning approaches that > could reverse engineer your rules to give you new rules that could be > evaluated very quickly. Some learning techniques and some configurations > would likely not give you precise accuracy, but some would likely give you > perfect replication. Random forest will probably give you accurate results > as would logistic regression (referred to as SGD in Mahout), especially if > you use interaction variables (that depend on the presence of tag > combinations). You will probably need to do a topological sort because it > is common for hierarchical structures to have rules that exclude a node from > a child if it appears in the parent (or vice versa). Thus, you would want > to evaluate rules in dependency order and augment the document with any > category assignments as you go down the rule list. > > Operationally, you would need to do some coding and not all of the pieces > you need are fully baked yet. The first step is vectorization of your tag > list for many documents. Robin has recently checked in some good code for > that and Drew has a more elaborate document model right behind that. You > can also vectorize directly from a Lucene index which is probably very > convenient for you. That gives you training data. > > Training the classifiers will take a bit since you need to train pretty much > one classifier per category (unless you know that a document can have only > one category). That shouldn't be hard, however, and with lots of examples > the training should converge to perfect performance pretty quickly. The > command line form for running training is evolving a bit right now and your > feedback would be invaluable. > > Deploying the classifiers should not be too difficult, but you would be in > slightly new territory there since I don't think that many (any) people have > deployed Mahout-trained classifiers in anger just yet. > > Does this help? > > > > > > > > On Wed, Feb 17, 2010 at 1:23 AM, David Stuart < > [email protected]> wrote: > >> Hi All, >> >> I think this question is appropriate for the Mahout mailing list but if not >> any pointers in the right direction or advise would be welcomed. >> >> We have a taxonomy based navigation system where items in the navigation >> tree are made up of tag based queries (instead of natural language words) >> which are matched against content items tagged in a similar way. >> >> so we have a taxonomy tree with queries >> Id Label >> 001 Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ... >> 002 Round >> 003 Apple >> 004 Orange >> 006 Star >> 007 Star fruit >> .... >> >> Content pool >> >> "Interesting article on fruit" -> tagged with (123, 234, 675) >> "The mightly orange!" -> tagged with (123, 324, 678) >> >> hopefully you get the picture.. >> >> Now we bake these queries into our Solr index so instead of doing the Fruit >> query we have pre done it and just search for items in index that have id >> 001 the reasons for doing this are not really important but we have written >> a indexer for the purpose. Also content items are multi-surfacing so a item >> could appear at 001, 004 and 007 >> >> Although the indexer is ok at doing this pre bake job its not very fast and >> as the content and tree grows it gets slower. >> >> NOW for the actual Question!!! >> >> Is there a ML model that can quickly classify/identify where a new (or >> retagged) piece of content fits onto the tree. Oh the queries on the leaf >> nodes can change (less often) so a quick process to reclassify what is in >> score for that leaf would be useful. >> The reason I want this is because it would great have realtime feed back to >> an author applying tags to a document of where it fits in the site. >> >> Once I get this working I would love to add suggested tags or weighting >> based on content items with contextual similarity. >> I think it was Grant that was talking about a Solr external field that >> could be used to hook this together or maybe I am mistaken >> >> Hope this makes sense >> >> Thanks for you help/advise in advance >> >> Regards, >> >> Dave >> >> > > > -- > Ted Dunning, CTO > DeepDyve
