Author: Alexander Barkov
> I'm trying to use mnogosearch as simple parser because it is much
> better than other scripts that were created specially for data
> extraction and analysis in my opinion. Is it possible to store full
> html code in database using "Section"? I have tried but it always strip
> html tags. CachedCopy looks encrypted.
It's a compressed content (using "deflate"), then wrapped into base64.
So to get the full HTML code, you can do base64-decode, followed by
zlib's inflate. This needs some programming. A simple PHP program
should do the trick.
Alternatively, you can extract cached copies using search.cgi,
> I want to save full pages and
> than explore dump with prepared parser to extract structured data.
> If such thing is not possible with "Section" by default what source
> code files I must explore? Any simple hack is possible?
Storing the original HTML code is possible in the version 3.4.
You can download a pre-release of 3.4.0 from here:
3.4 stores cached copies differently (comparing to 3.3):
- in a new table "urlinfob", separately from the "Section" values.
- without base64 encoding (in a "BLOB" instead of "TEXT" column)
- compressed by default using deflate,
but with an option to switch compression off.
To store cached copies uncompressed, add this command
Note, the table name "urlinfob" will probably change to "cachedcopy"
in the final 3.4.0 release.
The 3.4 manual is already online.
These pages might be of interest for you:
General mailing list