Hi everyone:
I'm trying to fit a regressor for doing price
estimation with a dataset of product ads (think eBay).
Descriptions are in natural language and redacted
by non-professionals, so they are subject to all the
types of common mistakes that need to be taken into
consideration when learning from natural language
(i.e., misspelled words, wrong grammar, etc.). Also,
some of these ads are outliers, in the sense that
they have wrong (super-low or super-high) prices just
to get noticed. I have a very large dataset (10^5 ads)
from a lot of different product categories (i.e., cars,
consumer electronics, real estate, etc.).
I would ideally like to fit a single regressor for all
categories, but I can also try a two-step
approach (training a classifier first to determine category
and then use a regressor tuned for
that category). I have to fit offline, and then
perform new regressions
online (for a website I'm making).
I have two main considerations:
1) The regressor should not be a black box, I mean,
I would like it to give me, ideally, a weight
for each feature it found, so that I could explain, for
instance, that this laptop costs $1000, because
the `Core I5` feature costs $300, the `128 GB SSD
hard drive` feature costs $200 and so on (not real
numbers, just saying). In some sense, feature extraction,
with as less features as possible. I'm not
an expert, but I think this requires an L1 regularization?
2) The actual regression for new items should be as simple as
possible, because at some point I'd like to
run the regression entirely in the browser (remember is
for a commercial product), and thus implement
it in JavaScript.
It doesn't need to be very precise (again, is for a commercial
product, not an actual research), like, it
could give me a rough estimate with 10% or 20%, I think
I can live with that. However, I would do like to
know an estimate for the error.
I've been looking around and though of using Stochastic
Gradient Descent, because it seems to fit my problem
quite well. I just wanted to ask for general directions,
and if I need to look into some issue I might run
into that I haven't thought (there surely must be plenty).
Best regards and thanks in advance.
Alejandro Piad.
University of Havana, Cuba.
------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general