Nulik, To decide whether to use HTM technology for a problem, the first question you must ask is "is this a temporal data problem?". HTM works best with fast-streaming temporal data, just like the brain processing fast streams of sensory input data. From reading your email, I don't think you have a temporal problem here. You have a spatial classification problem, very similar to extracting features from images.
I suggest you look into the recent deep learning techniques to do this type of thing. You could also look into Cortical.io's retina API, which could help you classify blocks of text using generalized semantic fingerprints. See http://cortical.io. Regards, --------- Matt Taylor OS Community Flag-Bearer Numenta On Sun, Feb 14, 2016 at 4:19 PM, Nulik Nol <[email protected]> wrote: > Hi, > I am new to machine learning and I would like ask you for advice on this > application. Basically I would like to know if Nupic is suitable for this or > maybe I should complement it with some other tools/methods? > > Problem: from a webpages of (mostly) product catalogs I must extract the > product data into a structured form , i.e. a record to be inserted into a > database. > > Input: My set of input data (web pages) may contain products or it may > contain totally different stuff (false positives), but in 80% of cases it > will be product information. > > Output: > My webcrawler must read the pages , extract the text and feed pure ASCII > text to the identification engine (my app with Nupic), which should do > this: > - Identify text blocks that describe product properties > - Select each property data and assign it a name automatically > - Identify text that is the value of that property > - Create a record in a form of 'key value' pair for all properties and > insert it to a database table > > To make that easier, I can filter and feed a lot of data and analyze all at > once, so generalization would be more global. > I want this to be error free, so I will be discarding a lot of proposed > key-value pairs that have not very good match (i.e. the probability of it to > be a key-value generalization is low). > > Example: > Lets say I want to find all mobile phones with Android OS. I query google > with "android mobile phone" parameters and feed the resulting webpages to my > engine. It must return something like this: > Key Value > Record 1: > OS Android > Color blue > Manufacturer Samsung > Price $99.00 > Record 2: > OS Android > Color black > Manufacturer Huawei > Price $249.00 > ..... > There may be other properties like 'OS version' but since it is not a widely > used field, I will be discarding it for having a low probability of being a > key-value pair. > For testing I will be using search engines and feed all the pages that > contain certain keywords. > > Is this an achievable application? I have a perception that this kind of > problem is too much for current AI capabilities, but you would know better > than me for sure. If this is possible, I may complicated the algorithm by > auto completing missing property values by adding rules like 'if it is > Samsung Galaxy XYZ, then it has Android OS' , and so on... > > Will appreciate very much your comments. > > Nulik > p.s. > I thought of solving this problem by template parsing, but this is not what > I want, I want to build a really intelligent system
