Yes, you can use CountVectorizer.
Do you want the different features to share the same vocabulary?
To use the Count Vectorizer, you probably have to either get all the
values (for a shared vocabulary)
or learn one CountVectorizer per key (you could use FeatureUnion for that).
So there is a little bit of code to write to handle the fact that you
have multiple text fields.
Hth,
Andy
On 06/21/2014 03:35 PM, Abijith Kp wrote:
Hi,
Initially, one of my feature list looks like: {"a":"3",
"b":"random1", "c":"", "d":"random2 text"}.
The random text contains names of people, email ids, some description,
numbers and goes on.
When I used DictVectorizer, I could not get an accurate clustering.
I wanted know if I could get any method similar to DictVectorizer,
which could process a dictionary of string features, correctly.
Regards,
Abijith
On Sat, Jun 21, 2014 at 6:51 PM, Andy <[email protected]
<mailto:[email protected]>> wrote:
Hi Abijith.
It depends on how you want to interpret the strings.
If they are texts and you want to interpret them based on their
content, Brians suggestion is the right one.
If you want to consider each possible string as a distinct
feature, the OneHotEncoder would be the right choice.
Could you give an example of what the strings and the semantics of
the strings are?
Andy
On 06/20/2014 06:05 PM, Abijith Kp wrote:
Can anyone help me with the problem of dealing with feature which
are both strings of varying length(say from 0 to 100-150
characters) and numbers?
What will be the most widely used techniques in such kind of
situations? And can it be solved using only scikit-learn?
PS: Initially I have to convert a json file to a feature's list,
and then use it.
Any help is appreciated.
Regards,
Abijith
--
Abijith KP
github.com/abijith-kp <http://github.com/abijith-kp>
kpabijith.wordpress.com <http://kpabijith.wordpress.com>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk
Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Abijith KP
github.com/abijith-kp <http://github.com/abijith-kp>
kpabijith.wordpress.com <http://kpabijith.wordpress.com>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general