Re: sed filter module

2007-03-14 Thread Frank

Just wanted to add my two cents worth...

We are using mod_line_edit a lot and would like to see a similar 
functionality coming with Apache by default. :-)


When I am correct mod_line_edit has the 'wrong' license model for being 
included into Apache by default.


Just for your infomation: There are more modules having a similar 
functionality:


http://mod-replace.sourceforge.net/
http://yomi.2288.org/forum/ftopic22.html (given by 
http://modules.apache.org/search?id=857)

http://happygiraffe.net/mod_sed.html (VERY old)


All modules are missing a feature we would like to see: Like in 
mod_rewrite's RewriteMap it would be cool to specify a function being 
called on the argument while replacing. E.g.:


RewriteBodyLine 'http://(.*?)/(.*)/(.*)' 
'http://${LOWERCASE:$1}/${MD5:$2}/$3'


... as I told before: Just my $.2

P.S.: And I vote for a better name like 'mod_filter_pcre' ...


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 10:07:49 +0100
Frank [EMAIL PROTECTED] wrote:

 Just wanted to add my two cents worth...
 
 We are using mod_line_edit a lot and would like to see a similar 
 functionality coming with Apache by default. :-)

Sounds like a vote.

 When I am correct mod_line_edit has the 'wrong' license model for
 being included into Apache by default.

Indeed.  When my modules have been integrated into the standard
distribution in the past, they've moved to the Apache license.
It's not a problem when there's a good reason for it.

 Just for your infomation: There are more modules having a similar 
 functionality:

Interesting!
 
 http://mod-replace.sourceforge.net/

That one's genuinely interesting.  Looks like an alternative
reverse-proxy solution, combining filtering with the mod_proxy cookie
rewriting that was missing in 2.0.  But it buffers an entire response
in memory, which limits its usefulness.

 http://yomi.2288.org/forum/ftopic22.html (given by 
 http://modules.apache.org/search?id=857)

My chinese isn't up to finding a download link there!

 http://happygiraffe.net/mod_sed.html (VERY old)

No thank you:-)

 All modules are missing a feature we would like to see: Like in 
 mod_rewrite's RewriteMap it would be cool to specify a function being 
 called on the argument while replacing. E.g.:
 
 RewriteBodyLine 'http://(.*?)/(.*)/(.*)' 
 'http://${LOWERCASE:$1}/${MD5:$2}/$3'

This kind of feature is on the to-do list, amongst some
hacks-in-progress that have yet to reach the mod_line_edit site.
This is actually what alarms me somewhat about the prospect of
a different but near-identical module in /trunk/: it leaves me 
either abandoning or redoing some of this stuff.

 P.S.: And I vote for a better name like 'mod_filter_pcre' ...

But it isn't.  It offers string as well as regex matching!

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Jim Jagielski


On Mar 14, 2007, at 5:07 AM, Frank wrote:



RewriteBodyLine 'http://(.*?)/(.*)/(.*)' 'http://${LOWERCASE:$1}/$ 
{MD5:$2}/$3'




Yeah, that would be useful... Of course, the main issue is
that whereas mod_rewrite can afford to be dog slow, because,
after all, the URLs aren't *that* big, in-place rewriting
of content can't be. The more complex the functionality,
the slower it will be... :/


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 09:25:11 -0400
Jim Jagielski [EMAIL PROTECTED] wrote:

 
 On Mar 14, 2007, at 5:07 AM, Frank wrote:
 
 
  RewriteBodyLine 'http://(.*?)/(.*)/(.*)' 'http://${LOWERCASE:$1}/$ 
  {MD5:$2}/$3'
 
 
 Yeah, that would be useful... Of course, the main issue is
 that whereas mod_rewrite can afford to be dog slow, because,
 after all, the URLs aren't *that* big, in-place rewriting
 of content can't be. The more complex the functionality,
 the slower it will be... :/

Solved in mod_line_edit: the code path for extra functionality
(such as per-rule conditional execution and environment variable
substitution) is invoked only when required.

As for the particular case Frank asked for, that works by
expanding the union to include a function pointer alongside
the strmatch and regexp cases.  So it's also a per-rule
configuration flag, and never touches the code path except
where explicitly invoked.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 13:45:47 +
Nick Kew [EMAIL PROTECTED] wrote:


 As for the particular case Frank asked for, that works by
 expanding the union to include a function pointer alongside
 the strmatch and regexp cases.  So it's also a per-rule
 configuration flag, and never touches the code path except
 where explicitly invoked.

Sorry, I meant the to field becomes a union which may
be a function.


-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Joe Orton
On Tue, Mar 13, 2007 at 09:24:25AM -0400, Jim Jagielski wrote:
 There have been times when having a simple sed filter in Apache
 would be useful... I used to use just ext_filter to do this,
 but this got more and more painful the more I used it. So awhile
 ago I made mod_sed_filter which I find pretty useful. I've just
 built and tested in with 2.2 and trunk...
 
 Anyone mind if I fold it into trunk and maybe have us
 consider making it part of 2.2 (even under experimental)?
 
 No docs yet but the code is:
 
   http://people.apache.org/~jim/code/mod_sed_filter.c

It would be good to have a simple filter like this in the tree.  From a 
quick review:

1) the filtering logic is broken and will consume RAM proportional to 
response size.  The mantra for writing output filters should be: read 
buckets, process buckets, pass buckets, repeat

2) 200-line functions are hard to read :)

...otherwise looks like nice simple code.  I don't see a *big* issue 
with the name implying likeness-of-sed.  mod_{pcre,text}_filter or 
something is as good.

Nick, are you actually planning to submit mod_line_edit for inclusion in 
the tree?

joe


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 14:32:13 +
Joe Orton [EMAIL PROTECTED] wrote:

 1) the filtering logic is broken and will consume RAM proportional to 
 response size.

I must've missed that when I looked.  I thought it used the
same logic as mod_line_edit, which is very careful about that.

Oh, I guess you mean the copying to get a null-terminated string
when applying a regexp?  And I see it's repeated for every regexp
(ouch)!  mod_line_edit uses a local pool which is cleared at the
end of each brigade, and avoids multiple copies of the same buffer.

 2) 200-line functions are hard to read :)

mod_line_edit does the same there, but that's definitely being split
(not least so that the actual search-and-replace function can be
re-used in a companion input filter).  And given that it's unusually
well-commented and half of it features as example code in my book,
I don't think it's hard to read:-)

 Nick, are you actually planning to submit mod_line_edit for inclusion
 in the tree?

The subject hasn't arisen until this thread (which caught me rather
off-balance), but I'll be happy to include it if there's demand.

As I hinted, there are some enhancements in the pipeline.
If it goes in to trunk, a roadmap would probably be in order.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Jim Jagielski


On Mar 14, 2007, at 11:01 AM, Nick Kew wrote:


Oh, I guess you mean the copying to get a null-terminated string
when applying a regexp?  And I see it's repeated for every regexp
(ouch)!  mod_line_edit uses a local pool which is cleared at the
end of each brigade, and avoids multiple copies of the same buffer.



Hmmm... I'm confused. The way I do it is:

loop over sed scripts
  loop over buckets
read bucket
  make copy of bucket data for regex comparison

so everytime we read in bucket data, I have to make
a null-termed string. It changes with each bucket.
So I don't understand the issue with it being repeated
for every regexp. How can that be avoided?

I reuse allocated space (I don't just simply keep
making strdups)... so yeah, there will be a chunk
of allocated spool still hanging around. So maybe
making that a subpool and then clearing/destroying
it would be best.


Re: sed filter module

2007-03-14 Thread Joe Orton
On Wed, Mar 14, 2007 at 03:01:53PM +, Nick Kew wrote:
 On Wed, 14 Mar 2007 14:32:13 +
 Joe Orton [EMAIL PROTECTED] wrote:
 
  1) the filtering logic is broken and will consume RAM proportional to 
  response size.
 
 I must've missed that when I looked.  I thought it used the
 same logic as mod_line_edit, which is very careful about that.

It looks just as broken to me.  It will read() from every bucket in the 
input brigade without passing anything on, so you guarantee that the 
entire response is mapped into RAM for a single filter invocation.

joe


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 15:27:44 +
Joe Orton [EMAIL PROTECTED] wrote:

 On Wed, Mar 14, 2007 at 03:01:53PM +, Nick Kew wrote:
  On Wed, 14 Mar 2007 14:32:13 +
  Joe Orton [EMAIL PROTECTED] wrote:
  
   1) the filtering logic is broken and will consume RAM
   proportional to response size.
  
  I must've missed that when I looked.  I thought it used the
  same logic as mod_line_edit, which is very careful about that.
 
 It looks just as broken to me.  It will read() from every bucket in
 the input brigade without passing anything on,

Yes, the processing unit is the brigade.  A bucket could easily be
just a byte or two, whereas a brigade is more likely to be a sensible
amount of the data (such as the 8K seen when mod_proxy is driving,
and which is the most common usage case).

so you guarantee that
 the entire response is mapped into RAM for a single filter invocation.

Nope.  Just one brigades worth at a time.  And the most likely case
for that to be an entire document is when it's a static file, and
document == brigade == bucket.


-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 11:15:00 -0400
Jim Jagielski [EMAIL PROTECTED] wrote:

 
 On Mar 14, 2007, at 11:01 AM, Nick Kew wrote:
 
  Oh, I guess you mean the copying to get a null-terminated string
  when applying a regexp?  And I see it's repeated for every regexp
  (ouch)!  mod_line_edit uses a local pool which is cleared at the
  end of each brigade, and avoids multiple copies of the same buffer.
 
 
 Hmmm... I'm confused. The way I do it is:
 
 loop over sed scripts
loop over buckets
  read bucket
make copy of bucket data for regex comparison

You're right, I was confused, and mod_line_edit does exactly the same.
What I'd like to get rid of is that copy inside the loop: once
copied, the copied bucket data should be reusable for other scripts.
But as we both found, that's harder!

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Joe Orton
On Wed, Mar 14, 2007 at 03:45:05PM +, Nick Kew wrote:
 Nope.  Just one brigades worth at a time.  And the most likely case
 for that to be an entire document is when it's a static file, and
 document == brigade == bucket.

I'm not sure what you're saying here.  Which do you agree with:

a) size of data represented by a brigade is limited only by apr_off_t
b) httpd does use brigades representing large amounts of content e.g. 
containing FILE or CGI/PIPE buckets
c) if you loop through all the buckets in a brigade calling read() on 
every one, you map all the data represented by the brigade into RAM
d) writing filters which use RAM proportional to content size is bad

joe


Re: sed filter module

2007-03-14 Thread Nick Kew
On Wed, 14 Mar 2007 16:56:41 +
Joe Orton [EMAIL PROTECTED] wrote:

 On Wed, Mar 14, 2007 at 03:45:05PM +, Nick Kew wrote:
  Nope.  Just one brigades worth at a time.  And the most likely case
  for that to be an entire document is when it's a static file, and
  document == brigade == bucket.
 
 I'm not sure what you're saying here.  Which do you agree with:
 
 a) size of data represented by a brigade is limited only by apr_off_t

ditto size of a bucket

 b) httpd does use brigades representing large amounts of content e.g. 
 containing FILE or CGI/PIPE buckets

Again, the unit of indefinite size is the bucket

 c) if you loop through all the buckets in a brigade calling read() on 
 every one, you map all the data represented by the brigade into RAM

Indeed.

 d) writing filters which use RAM proportional to content size is bad

Yep.

Now, what leads you to suppose mod_line_edit uses RAM proportional
to content size?  Other than when the entire contents arrive in a
single bucket?

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-14 Thread Justin Erenkrantz

On 3/14/07, Nick Kew [EMAIL PROTECTED] wrote:

to content size?  Other than when the entire contents arrive in a
single bucket?


Uh, a file bucket?  -- justin


Re: sed filter module

2007-03-14 Thread Jim Jagielski

As a rough proof of concept, I refactored the design,
allowing for the pattern matching and substitution to be
done as soon as we have a line. Also is some
rough ability to pass the data to the next filter
after we get more than ~AP_MIN_BYTES_TO_WRITE bytes.
Doesn't alleviate all the problems, but it allows
for us to pass data quicker (we still have the issue
where we need to fully read in the bb though...)
It's rough but passes superficial testing...

More work needs to be done, but more people could
work on it if I just commit to trunk :)

Same URL, different version:

http://people.apache.org/~jim/code/mod_sed_filter.c



Re: sed filter module

2007-03-14 Thread Joe Orton
On Wed, Mar 14, 2007 at 06:38:48PM +, Nick Kew wrote:
 Now, what leads you to suppose mod_line_edit uses RAM proportional
 to content size?  Other than when the entire contents arrive in a
 single bucket?

Because it implements the naive filter implementation, equivalent to:

e = APR_BRIGADE_FIRST(bb);
while (e != APR_BRIGADE_SENTINEL(bb)) {
   apr_bucket_read(e, ...);
   ...process bucket without passing on to f-next or deleting...
   e = APR_BUCKET_NEXT(e);
}

for the general case given bb contains a single FILE bucket, or a 
CGI/PIPE bucket, or any morphing bucket type which doesn't represent a 
chunk of memory, this does:

After Iter# Contents of bb  Heap memory used
1   HEAP FILE   8K
2   HEAP HEAP FILE  16K
3   HEAP HEAP HEAP FILE 24K
...
n   HEAP*n  n*8K

where n ~= file size / 8K; FILE buckets will also morph into MMAP 
buckets so the practice is a bit more complicated but this illustrates 
the point... and the 8K is really 8000 bytes.

joe


Re: sed filter module

2007-03-13 Thread Nick Kew
On Tue, 13 Mar 2007 09:24:25 -0400
Jim Jagielski [EMAIL PROTECTED] wrote:


   http://people.apache.org/~jim/code/mod_sed_filter.c

At a glance, it looks like mod_line_edit.
Are you doing anything different?

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-13 Thread William A. Rowe, Jr.
Jim Jagielski wrote:
 Anyone mind if I fold it into trunk and maybe have us
 consider making it part of 2.2 (even under experimental)?

+1 to trunk!  No opinion yet on 2.2 (I'm not a big fan of growing
the stable branch since it entirely defeats the drive to release
2.next, ever.)

 No docs yet but the code is:
 
 http://people.apache.org/~jim/code/mod_sed_filter.c
 
 and the usage is easy:
 
 AddOutputFilterByType SEDFILTER text/html
 Sed s/foo/bar/in
 Sed s#monkey(hat)#chimp-$1#i
 Sed s/works/functions/in
 
 note that it uses sed line controls, flexible
 delims and support regex and simple pattern match (the 'n'
 flag... no real sed option there ;) )

Is this sed or pcre syntax?  I'm a bit confused :)

Although it's sed-ish, is it misleading to confuse the user with the
phrase sed considering the unsupported constructs?  E.g. I presume
the more complex sed language features aren't present.

I'm wondering if mod_pcre_filter wouldn't be more accurate?


Re: sed filter module

2007-03-13 Thread Jim Jagielski


On Mar 13, 2007, at 1:10 PM, William A. Rowe, Jr. wrote:



Is this sed or pcre syntax?  I'm a bit confused :)



It's a mutant ;) But, of course, we maintain
that confusion internally with regex's being pcre...


Although it's sed-ish, is it misleading to confuse the user with the
phrase sed considering the unsupported constructs?  E.g. I presume
the more complex sed language features aren't present.

I'm wondering if mod_pcre_filter wouldn't be more accurate?



'sed' certainly gets the message across though :)
But basically it allows for regex pattern matching
and substitution in a very sed-like way.

By agreed that docs would help this


Re: sed filter module

2007-03-13 Thread Nick Kew
On Tue, 13 Mar 2007 13:34:07 -0400
Jim Jagielski [EMAIL PROTECTED] wrote:

 
 On Mar 13, 2007, at 1:10 PM, William A. Rowe, Jr. wrote:
 
 
  Is this sed or pcre syntax?  I'm a bit confused :)
 
 
 It's a mutant ;) But, of course, we maintain
 that confusion internally with regex's being pcre...
 
  Although it's sed-ish, is it misleading to confuse the user with the
  phrase sed considering the unsupported constructs?  E.g. I presume
  the more complex sed language features aren't present.
 
  I'm wondering if mod_pcre_filter wouldn't be more accurate?
 
 
 'sed' certainly gets the message across though :)
 But basically it allows for regex pattern matching
 and substitution in a very sed-like way.
 
 By agreed that docs would help this

AFAICS, this not merely looks like mod_line_edit: the filter *is*
mod_line_edit, right down to the bucket manipulation logic used as
an example in The Book!  It's just missing a couple of minor features,
and has a slightly different configuration syntax.  The other difference
is 15 months out there in widespread use.

I'm even more confused now, because I thought you were with Covalent,
and I understood from Will that mod_line_edit was widely used by
clients of Covalent.  Please tell me what I'm missing?

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/


Re: sed filter module

2007-03-13 Thread William A. Rowe, Jr.
Nick Kew wrote:
 
 I'm even more confused now, because I thought you were with Covalent,
 and I understood from Will that mod_line_edit was widely used by
 clients of Covalent.  Please tell me what I'm missing?

Just to ensure I'm not misquoted, I know I've suggested mod_line_edit
to a few Covalent clients who's desired manipulations would be best served
by a raw text manipulation program (e.g. no html/xml aware transforms).
I'm not clear if they adopted it (I haven't gotten follow up questions)
but I had passed on a quiet inquiry to you if you would be available for
consulting or support if users encountered issues, on Covalent's nickel,
of course, as anything we 'endorse' we back up in our support contracts.

Personally can't speak to any of your other questions or concerns, since
I just became aware of this module when you did.  But I'm sure Jim will
respond and satisfy your concerns.

Bill



Re: sed filter module

2007-03-13 Thread Jim Jagielski


On Mar 13, 2007, at 2:08 PM, Nick Kew wrote:



AFAICS, this not merely looks like mod_line_edit: the filter *is*
mod_line_edit, right down to the bucket manipulation logic used as
an example in The Book!  It's just missing a couple of minor features,
and has a slightly different configuration syntax.  The other  
difference

is 15 months out there in widespread use.



What logic? Let me know what sections you mean because
most of what I based it on is stuff from mod_include
and mod_proxy_ftp.c (and other ASF modules). I don't see
anything in either module which is new or not done by
any other modules out there that need to split out sections
from buckets.

Bill told me about mod_line_edit maybe 3-4 days ago.
I had known about mod_proxy_html, which is also something
we've pointed clients to, so maybe that's where
the confusion comes from.



Re: sed filter module

2007-03-13 Thread William A. Rowe, Jr.
Jim Jagielski wrote:
 
 Bill told me about mod_line_edit maybe 3-4 days ago.
 I had known about mod_proxy_html, which is also something
 we've pointed clients to, so maybe that's where
 the confusion comes from.

Good point - in my experience mod_proxy_html is much more broadly
adopted both by our customers, and by others I chat with at users@,
because it appears (to them) to be the obvious solution to their problem.

Most don't even realize that mod_line_edit can accomplish the same
(and perhaps more efficiently) in many cases :)

Bill


Re: sed filter module

2007-03-13 Thread William A. Rowe, Jr.
Jim Jagielski wrote:
 
 On Mar 13, 2007, at 1:10 PM, William A. Rowe, Jr. wrote:
 

 Is this sed or pcre syntax?  I'm a bit confused :)
 
 It's a mutant ;) But, of course, we maintain
 that confusion internally with regex's being pcre...

Of course :)  But it appears to be a tiny fraction of the sed language...

 Although it's sed-ish, is it misleading to confuse the user with the
 phrase sed considering the unsupported constructs?  E.g. I presume
 the more complex sed language features aren't present.

 I'm wondering if mod_pcre_filter wouldn't be more accurate?
 
 'sed' certainly gets the message across though :)
 But basically it allows for regex pattern matching
 and substitution in a very sed-like way.

since it is only a pattern substitution subset, I'd prefer to see some
RewriteBody directive or similar.  As I'm looking at the module, I'm more
convinced that Sed foo should be reserved for at least a basic sed
implementation that implemented (at least!) the pre-GNU language subset.

Bill


Re: sed filter module

2007-03-13 Thread Jim Jagielski


On Mar 13, 2007, at 3:34 PM, William A. Rowe, Jr. wrote:


Jim Jagielski wrote:


On Mar 13, 2007, at 1:10 PM, William A. Rowe, Jr. wrote:



Is this sed or pcre syntax?  I'm a bit confused :)


It's a mutant ;) But, of course, we maintain
that confusion internally with regex's being pcre...


Of course :)  But it appears to be a tiny fraction of the sed  
language...



Although it's sed-ish, is it misleading to confuse the user with the
phrase sed considering the unsupported constructs?  E.g. I presume
the more complex sed language features aren't present.

I'm wondering if mod_pcre_filter wouldn't be more accurate?


'sed' certainly gets the message across though :)
But basically it allows for regex pattern matching
and substitution in a very sed-like way.


since it is only a pattern substitution subset, I'd prefer to see some
RewriteBody directive or similar.  As I'm looking at the module,  
I'm more

convinced that Sed foo should be reserved for at least a basic sed
implementation that implemented (at least!) the pre-GNU language  
subset.




:)

Well, like I said, the main issue was avoiding the overhead of
having mod_ext_filter do simple in-line replacements by calling
sed to do 's/foo/bar/'... So yeah, it's closer to what a Perl
guy would think than a Unix sed-head :)