I've also had success using Tika to do text extraction (via IKVM)

This GitHub repo has example code and tests for pulling contents out of PDF,
word documents etc.

https://github.com/KevM/tikaondotnet

Works great for me in a product I helped create. Once you have the text of
the document you index it as you would normal content.

Kevin Miller
<https://github.com/KevM/tikaondotnet>

On Wed, Feb 2, 2011 at 1:10 AM, Prescott Nasser <[email protected]>wrote:

>
> Just to add since you're likely on a windows platform, check out Ifilters
> and how to use them- they are probably the easiest way you have to extract
> data from pdf/html/xml.
>
> Check out this for getting started with using the Ifilter interface:
> http://www.codeproject.com/KB/cs/IFilter.aspx?msg=2428047
>
> Once you extract the plain text - that is where Lucene comes in to parse
> that plain text and create an index.
>
> ~P
>
>
>
>
> > From: [email protected]
> > Date: Wed, 2 Feb 2011 12:09:01 +1100
> > Subject: Re: Question
> > To: [email protected]
> >
> > Lucene.Net uses the same binary data store that Lucene uses which is
> stored
> > on the file system (generally, it depends on what Directory instance you
> > provide to the indexer & searcher).
> >
> > Some projects, such as NHibernate.Search and RavenDB use Lucene.Net
> > internally and handle syncronizing the data stores (DB & Lucene).
> > If you're trying to index things such as HTML/ XML/ PDF/ etc you have to
> > write your own way to read the data into Lucene though.
> > Aaron Powell
> > Umbraco Core Team Member <http://umbraco.codeplex.com> | FunnelWeb Team
> > Member <http://funnelweblog.com>
> >
> > http://www.aaron-powell.com | http://twitter.com/slace | Skype:
> > aaron.l.powell | MSN: [email protected]
> >
> >
> > On Wed, Feb 2, 2011 at 12:03 PM, Lucas E Wall <[email protected]>
> wrote:
> >
> > >
> > > Thanks, Aaron. I went through your blog and it makes a lot of sense.
> > > Given that Lucen is asp friendly, can I call the library from mssql?
> Where
> > > does the indexing gets stored? Do I need to provide a database for do
> files
> > > I need indexed, and for the index as well? May be my questions are a
> little
> > > bit too entry level.
> > >
> > > > From: [email protected]
> > > > Date: Wed, 2 Feb 2011 11:04:45 +1100
> > > > Subject: Re: Question
> > > > To: [email protected]
> > > >
> > > > You don't actually install Lucene.Net, it's just a library which you
> > > > reference into your application. Solr is an installable Lucene
> service,
> > > > which essentially provides RESTful endpoints to Lucene (java), or so
> goes
> > > my
> > > > understanding.
> > > >
> > > > With regards to what you can search with Lucene, well that really
> comes
> > > down
> > > > to anything you can push into the index. Keep in mind that Lucene is
> just
> > > a
> > > > indexer and searcher, it's not a crawler or anything. You have to
> push
> > > the
> > > > data to the indexer, and you have to write queries to get it back
> out.
> > > > I've got some blogs on my site about getting started with Lucene.Net
> -
> > > > http://www.aaron-powell.com/lucene-net-overview
> > > > Aaron Powell
> > > > Umbraco Core Team Member <http://umbraco.codeplex.com> | FunnelWeb
> Team
> > > > Member <http://funnelweblog.com>
> > > >
> > > > http://www.aaron-powell.com | http://twitter.com/slace | Skype:
> > > > aaron.l.powell | MSN: [email protected]
> > > >
> > > >
> > > > On Wed, Feb 2, 2011 at 10:57 AM, Lucas E Wall <[email protected]
> >
> > > wrote:
> > > >
> > > > >
> > > > > I am new to Lucene and have the following questions. What is the
> best
> > > way
> > > > > to understand what is required to install Lucene in a server? Also,
> > > can i
> > > > > make Lucene run searches on links to xml data on the web?Thanks
> > > > >
> > >
> > >
>

Reply via email to