Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I'm trying to use mnogosearch as simple parser because it is much 
> better than other scripts that were created specially for data 
> extraction and analysis in my opinion. Is it possible to store full 
> html code in database using "Section"? I have tried but it always strip 
> html tags. CachedCopy looks encrypted.

It's a compressed content (using "deflate"), then wrapped into base64.
So to get the full HTML code, you can do base64-decode, followed by 
zlib's inflate. This needs some programming. A simple PHP program
should do the trick.

Alternatively, you can extract cached copies using search.cgi,
like this:
./search.cgi "&cc=1&URL=http://www.site.com/test.html";


> I want to save full pages and 
> than explore dump with prepared parser to extract structured data.
> 
> If such thing is not possible with "Section" by default what source 
> code files I must explore? Any simple hack is possible? 

Storing the original HTML code is possible in the version 3.4.
You can download a pre-release of 3.4.0 from here:
http://www.mnogosearch.org/Download/mnogosearch-3.4.0.tar.gz

3.4 stores cached copies differently (comparing to 3.3):
- in a new table "urlinfob", separately from the "Section" values.
- without base64 encoding (in a "BLOB" instead of "TEXT" column)
- compressed by default using deflate,
  but with an option to switch compression off.

To store cached copies uncompressed, add this command
into indexer.conf:

CachedCopyEncoding identity

Note, the table name "urlinfob" will probably change to "cachedcopy"
in the final 3.4.0 release.

The 3.4 manual is already online.
These pages might be of interest for you:
http://www.mnogosearch.org/doc34/msearch-changelog.html
http://www.mnogosearch.org/doc34/msearch-cmdref-cachedcopyencoding.html


Reply: <http://www.mnogosearch.org/board/message.php?id=21608>

_______________________________________________
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general

Reply via email to