Re: worddoucments search

Ryan Ackley Tue, 24 Aug 2004 06:10:20 -0700

Code example for textmining.org library:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();


String str = extractor.extractText();


----- Original Message ----- 
From: "Natarajan.T" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, August 24, 2004 8:11 AM
Subject: RE: worddoucments search


> Hi Santhosh,
> 
> Try out the below attached code.....(POI.jar should be in your class
> path)
> 
> 
> public String getContent(InputStream reader) throws IOException {
>     ArrayList text = new ArrayList();
>     POIFSFileSystem fsys = new POIFSFileSystem(reader);
> 
>     DocumentEntry headerProps =
> (DocumentEntry)fsys.getRoot().getEntry("WordDocument");
>     DocumentInputStream din =
> fsys.createDocumentInputStream("WordDocument");
>     byte[] header = new byte[headerProps.getSize()];
> 
>     din.read(header);
>     din.close();
> 
>     //Get the information we need from the header
>     int info = LittleEndian.getShort(header, 0xa);
>     boolean useTable1 = (info & 0x200) != 0;
> 
>     //get the location of the piece table
>     int complexOffset = LittleEndian.getInt(header,
> 0x1a2);
> 
>     String tableName = null;
>     if (useTable1) {
>       tableName = "1Table";
>     }
>     else{
>       tableName = "0Table";
>     }
> 
>     DocumentEntry table =
> (DocumentEntry)fsys.getRoot().getEntry(tableName);
>     byte[] tableStream = new byte[table.getSize()];
>     din = fsys.createDocumentInputStream(tableName);
>     din.read(tableStream);
> din.close();
> 
>     din = null;
>     fsys = null;
>     table = null;
>     headerProps = null;
> 
>     int multiple = findText(tableStream, complexOffset,
> text);
> 
>     StringBuffer sb = new StringBuffer();
>     int size = text.size();
>     tableStream = null;
> 
> WordTextPiece nextPiece = null;
> int start ;
> int length;
> String toStr = "";
> for (int x = 0; x < size; x++) {
> nextPiece = (WordTextPiece)text.get(x);
> start = nextPiece.getStart();
> length = nextPiece.getLength();
> 
> boolean unicode =
> nextPiece.usesUnicode();
> if (unicode) {
> toStr = new String(header,
> start, length * multiple, "UTF-16LE"); 
> }
> else{ 
> toStr = new String(header,
> start, length , "ISO-8859-1"); 
> } 
> 
> }
> 
> reader.close();
> return toStr;
> }
> 
> 
> Regards,
> Natarajan.
> 
> 
> 
> -----Original Message-----
> From: Santosh [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, August 24, 2004 5:46 PM
> To: Lucene Users List
> Subject: worddoucments search
> 
> Can lucene be able to search word documents? if so please give me
> information about it
> 
> regards
> Santosh kumar
> 
> 
> -----------------------SOFTPRO DISCLAIMER------------------------------
> 
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
> 
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
> 
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
> 
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects. 
> 
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> ------------------------------------------------------------------------
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: worddoucments search

Reply via email to