On Fri, Apr 9, 2010 at 12:14 PM, Keisial <[email protected]> wrote: > Voytek Eymont wrote: >> Micah, >> >> thanks !!!!!! >> I'm loging in OK. >> >> on next step I do like: >> >> wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt >> --keep-session-cookies >> http://www.domain.tld/main.htm?_template=advanced&_module=active_list >> >> that fails until I put "" around the http string like so: >> >> wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt >> --keep-session-cookies >> "http://www.domain.tld/main.htm?_template=advanced&_module=active_list" >> >> or should I use some '%' characters ? for & ? or just " " around https >> string ? >> > > Just surround it with double " " or single ' ' quotes. > If & is not quoted your shell thinks you want to execute a program called > wget and then assign active_list to a shell variable called _module (if > there > wasn't a = it would try to run a program called _module, which would give > you an error message you could notice) > > >> next question: the resulting file has lots and lots of bumpf like >> space.gif galore, etc, >> >> how do I make into text as much as possible, is there a wget function, or ? >> > Remove anything between < and >, then unescape the entities. That should > give you quite clean text with a minimal effort.
Use grep, and sed. Grep and sed are your friends http://www.tech-recipes.com/rx/330/remove-html-tags-from-a-file/ FC
