Hi,
I am new to machine learning and I would like ask you for advice on this
application. Basically I would like to know if Nupic is suitable for this
or maybe I should complement it with some other tools/methods?

Problem: from a webpages of (mostly) product catalogs I must extract the
product data into a structured form , i.e. a record to be inserted into a
database.

Input: My set of input data (web pages) may contain products or it may
contain totally different stuff (false positives), but in 80% of cases it
will be product information.

Output:
My webcrawler must read the pages , extract the text and feed pure ASCII
text to the identification engine (my app with Nupic),  which should do
this:
- Identify text blocks that describe product properties
- Select each property data and assign it a name automatically
- Identify text that is the value of that property
- Create a record in a form of 'key value' pair for all properties and
insert it to a database table

To make that easier, I can filter and feed a lot of data and analyze all at
once, so generalization would be more global.
I want this to be error free, so I will be discarding a lot of proposed
key-value pairs that have not very good match (i.e. the probability of it
to be a key-value generalization is low).

Example:
Lets say I want to find all mobile phones with Android OS. I query google
with "android mobile phone" parameters and feed the resulting webpages to
my engine. It must return something like this:
Key Value
Record 1:
    OS              Android
    Color              blue
    Manufacturer    Samsung
    Price            $99.00
Record 2:
    OS              Android
    Color             black
    Manufacturer     Huawei
    Price           $249.00
.....
There may be other properties like 'OS version' but since it is not a
widely used field, I will be discarding it for having a low probability of
being a key-value pair.
For testing I will be using search engines and feed all the pages that
contain certain keywords.

Is this an achievable application? I have a perception that this kind of
problem is too much for current AI capabilities, but you would know better
than me for sure.  If this is possible, I may complicated the algorithm by
auto completing missing property values by adding rules like 'if it is
Samsung Galaxy XYZ, then it has Android OS' , and so on...

Will appreciate very much your comments.

Nulik
p.s.
I thought of solving this problem by template parsing, but this is not what
I want, I want to build a really intelligent system

Reply via email to