Hi,
I am new to machine learning and I would like ask you for advice on this
application. Basically I would like to know if Nupic is suitable for this
or maybe I should complement it with some other tools/methods?
Problem: from a webpages of (mostly) product catalogs I must extract the
product data into a structured form , i.e. a record to be inserted into a
database.
Input: My set of input data (web pages) may contain products or it may
contain totally different stuff (false positives), but in 80% of cases it
will be product information.
Output:
My webcrawler must read the pages , extract the text and feed pure ASCII
text to the identification engine (my app with Nupic), which should do
this:
- Identify text blocks that describe product properties
- Select each property data and assign it a name automatically
- Identify text that is the value of that property
- Create a record in a form of 'key value' pair for all properties and
insert it to a database table
To make that easier, I can filter and feed a lot of data and analyze all at
once, so generalization would be more global.
I want this to be error free, so I will be discarding a lot of proposed
key-value pairs that have not very good match (i.e. the probability of it
to be a key-value generalization is low).
Example:
Lets say I want to find all mobile phones with Android OS. I query google
with "android mobile phone" parameters and feed the resulting webpages to
my engine. It must return something like this:
Key Value
Record 1:
OS Android
Color blue
Manufacturer Samsung
Price $99.00
Record 2:
OS Android
Color black
Manufacturer Huawei
Price $249.00
.....
There may be other properties like 'OS version' but since it is not a
widely used field, I will be discarding it for having a low probability of
being a key-value pair.
For testing I will be using search engines and feed all the pages that
contain certain keywords.
Is this an achievable application? I have a perception that this kind of
problem is too much for current AI capabilities, but you would know better
than me for sure. If this is possible, I may complicated the algorithm by
auto completing missing property values by adding rules like 'if it is
Samsung Galaxy XYZ, then it has Android OS' , and so on...
Will appreciate very much your comments.
Nulik
p.s.
I thought of solving this problem by template parsing, but this is not what
I want, I want to build a really intelligent system