I've also had success using Tika to do text extraction (via IKVM) This GitHub repo has example code and tests for pulling contents out of PDF, word documents etc.
https://github.com/KevM/tikaondotnet Works great for me in a product I helped create. Once you have the text of the document you index it as you would normal content. Kevin Miller <https://github.com/KevM/tikaondotnet> On Wed, Feb 2, 2011 at 1:10 AM, Prescott Nasser <[email protected]>wrote: > > Just to add since you're likely on a windows platform, check out Ifilters > and how to use them- they are probably the easiest way you have to extract > data from pdf/html/xml. > > Check out this for getting started with using the Ifilter interface: > http://www.codeproject.com/KB/cs/IFilter.aspx?msg=2428047 > > Once you extract the plain text - that is where Lucene comes in to parse > that plain text and create an index. > > ~P > > > > > > From: [email protected] > > Date: Wed, 2 Feb 2011 12:09:01 +1100 > > Subject: Re: Question > > To: [email protected] > > > > Lucene.Net uses the same binary data store that Lucene uses which is > stored > > on the file system (generally, it depends on what Directory instance you > > provide to the indexer & searcher). > > > > Some projects, such as NHibernate.Search and RavenDB use Lucene.Net > > internally and handle syncronizing the data stores (DB & Lucene). > > If you're trying to index things such as HTML/ XML/ PDF/ etc you have to > > write your own way to read the data into Lucene though. > > Aaron Powell > > Umbraco Core Team Member <http://umbraco.codeplex.com> | FunnelWeb Team > > Member <http://funnelweblog.com> > > > > http://www.aaron-powell.com | http://twitter.com/slace | Skype: > > aaron.l.powell | MSN: [email protected] > > > > > > On Wed, Feb 2, 2011 at 12:03 PM, Lucas E Wall <[email protected]> > wrote: > > > > > > > > Thanks, Aaron. I went through your blog and it makes a lot of sense. > > > Given that Lucen is asp friendly, can I call the library from mssql? > Where > > > does the indexing gets stored? Do I need to provide a database for do > files > > > I need indexed, and for the index as well? May be my questions are a > little > > > bit too entry level. > > > > > > > From: [email protected] > > > > Date: Wed, 2 Feb 2011 11:04:45 +1100 > > > > Subject: Re: Question > > > > To: [email protected] > > > > > > > > You don't actually install Lucene.Net, it's just a library which you > > > > reference into your application. Solr is an installable Lucene > service, > > > > which essentially provides RESTful endpoints to Lucene (java), or so > goes > > > my > > > > understanding. > > > > > > > > With regards to what you can search with Lucene, well that really > comes > > > down > > > > to anything you can push into the index. Keep in mind that Lucene is > just > > > a > > > > indexer and searcher, it's not a crawler or anything. You have to > push > > > the > > > > data to the indexer, and you have to write queries to get it back > out. > > > > I've got some blogs on my site about getting started with Lucene.Net > - > > > > http://www.aaron-powell.com/lucene-net-overview > > > > Aaron Powell > > > > Umbraco Core Team Member <http://umbraco.codeplex.com> | FunnelWeb > Team > > > > Member <http://funnelweblog.com> > > > > > > > > http://www.aaron-powell.com | http://twitter.com/slace | Skype: > > > > aaron.l.powell | MSN: [email protected] > > > > > > > > > > > > On Wed, Feb 2, 2011 at 10:57 AM, Lucas E Wall <[email protected] > > > > > wrote: > > > > > > > > > > > > > > I am new to Lucene and have the following questions. What is the > best > > > way > > > > > to understand what is required to install Lucene in a server? Also, > > > can i > > > > > make Lucene run searches on links to xml data on the web?Thanks > > > > > > > > > > > >
