At 09:45 AM 6/2/04 -0700, Bill Moseley wrote:
<>
> I don't think either solution is particularly difficult to implement,
> but scanning the content files directly also lets us have an easier
> time analyzing the structure of the document.

All the server does is supply the content.  Analyzing the content
happens after that, regardless of using the server or the file system.
Spidering lets you index the content as people see it on their browser.

Take a look at the system being used: http://www.perlmonks.org/index.pl?node_id=357248, particularly the 'Documents' subsection.


Essentially, the actual content is stored in a POD-like format. The '=NAME' part specifies the TMPL_VAR or TMPL_IF that the given section should be placed into. By having a section '=content' or some such in every content file which specifies the regular content (this being defined as part of our coding standards), we can have the indexer focus only on that. There is no need to strip away the additional information of headers, footers, etc. that shouldn't be searched, because that information simply won't be in the text in the first place. With spidering, I either get all the content or none of it, which causes more processing.

Now, the system allows TMPL_INCLUDE tags in the content files (actually, it's implemented by passing it through HTML::Template a second time, so any TMPL_* tag will be processed, but this might change). Included files occasionally need to be part of the search, but most likely won't. But I don't feel I can make that assumption in all cases. So I need some way of saying which ones should be searched on if we should ever need that functionality (but default to not searching).


Spidering isn't as expensive as people tend to think.  If you are
running dynamic content as plain CGI then there's your costs.  Sounds
like you have mostly static content, so that shouldn't be an issue.

With this system, all "HTML" files are being run through an Apache mod_perl handler.



If you can avoid spidering (say your content is in a database) then,
yes, I'd always just index the content.

> In the case I'm suggesting using, the content files would be
> re-processed by HTML::Template for any TMPL_INCLUDE tags (however, not
> much has been created under this system besides some examples, so now
> is a good time to make incompatible changes if need be).

  <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no">

So the point is to have some other program not based on HTML::Template
to parse that?  In that case why not just look for:

   <!-- don't index this section -->

And then use HTML::Template to generate the output for indexing.


-- Bill Moseley [EMAIL PROTECTED]



------------------------------------------------------- This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Html-template-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/html-template-users

Reply via email to