Re: [Dspace-tech] More re Porter Stem Filter

Tim Donohue Wed, 02 Feb 2011 09:53:33 -0800

Richard,

Quick Answer:  Your suggested changes should not break anything. That 
would remove the stemming. Obviously, you would need to rebuild DSpace, 
and also reindex all of your content.


As for the philosophical discussion: we are always willing to rethink 
past decisions. :)

Stemming has been utilized in DSpace since the very beginning. But, that 
doesn't mean we need to continue to do so, if there's a good reason not 
to. I'd be interested to hear the views of others in the community. We 
could turn off stemming altogether, or offer an additional 
'DSNonStemmingAnalyzer' option if there was enough interest.  If you or 
others are interested in seeing that happen, I'd encourage you to 
suggest it as an improvement in our Issue Tracker
(https://jira.duraspace.org/browse/DS) and/or continue discussion on 
this list

Also, for more on how to make suggestions for DSpace improvements / new 
features, see https://wiki.duraspace.org/display/DSPACE/HowToContribute

Obviously, please feel free to continue discussion on this listserv 
thread. I'm interested to hear what others think, as stemming obviously 
has both pros and cons.

- Tim

On 2/2/2011 11:35 AM, Jizba, Richard wrote:
> Tim,
>
> Thanks for the response.
> The first option isn't an option for because we need to search for
> numbers.
> I did see something like the second option, which said to basically
> comment out the PorterStemFilter.
> So my question is, can I eliminate that level of stemming all together.
> This is what I want to do:
>
>   ===========================================
>      public final TokenStream tokenStream(String fieldName, final Reader
> reader)
>      {
>          TokenStream result = new DSTokenizer(reader);
>          result = new StandardFilter(result);
>          result = new LowerCaseFilter(result);
>          result = new StopFilter(result, stopSet);
>      /*    result = new PorterStemFilter(result); */
>          return result;
>      }
>   ============================================
>
> Will this 'break' anything?
>
> As I understand it, DSpace will then use the DSAnalyser, parse the
> character data into words, convert them to lower case and index the
> terms excluding the stop list.
>
> ==== Philosophical Discussion ====
>
> I am little surprised that the DSpace community thinks stemming like
> that done by the Porter Stemming Algorithm is so important. I have been
> searching bibliographic databases since the early 1980s and teach
> courses to our health sciences students on search techniques. We have
> always appreciated the systems that give us the power to find exactly
> the terms and the combinations we want. Language is just too rich and
> varied for any other approach in my experience. There have been many
> times when I have needed to search for a singular form of a noun vs a
> plural form or vice versa. Using truncation and wildcard operators is
> not rocket science. Lucene has some really powerful search operators,
> but their power is basically nullified by the Stemming operation.
>
> Our DSpace instance isn't aimed primarily at a broad worldwide user
> base, but select groups of students, staff and faculty with rather
> sophisticated information needs. Besides, most of our collection can
> also be discovered through Google. Why duplicate that, when I have the
> option of also creating an alternative search environment that provides
> for sophisticated, analytical searches of scholarly, curricular and
> administrative documents?
>
> You might be surprised at how quickly the people in our Office of
> Medical Education have picked up on the nuances of how and where they
> put metadata, the need for standardized vocabulary in defining lecture
> objectives, and how quickly they figured out what was happening to their
> attempts to search for "wellness" (stemmed to "well"). (It did not
> surprise me!)
>
> I think the distributed community administration available with DSpace
> will really help our faculty and staff  take seriously the data (text)
> they put into their collections. Our expertise as "consultants" and
> trainers to the staff in the Office of Medical Education has really made
> them appreciate the expertise of librarians, particularly my reference
> librarians who have very good analytical search skills. Don't sell
> people short -- they can be very sophisticated which means we need to
> provide them with powerful tools, not heavy-handed interventions (the
> Porter Algorithm)
>
> I'm planning on being at OR11 and would be happy to discuss this over a
> beer.
>
> If anybody is still with me, I would be curious if there is a
> LowerCaseFilter that would permit the retention of capital 'A's.
> Eliminating 'A's in medical research databases is a problem. Vitamin A
> is the obvious example, but there are many other occurrences of 'A' as
> an important, non-trivial term in a name.
>
> Richard Jizba
> Creighton University
>
>
> ------------------------------------------------------------------------------
> Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
> Finally, a world-class log management solution at an even better price-free!
> Download using promo code Free_Logger_4_Dev2Dev. Offer expires
> February 28th, so secure your free ArcSight Logger TODAY!
> http://p.sf.net/sfu/arcsight-sfd2d
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] More re Porter Stem Filter

Reply via email to