[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17105#comment-17105
 ] 

Robin Sommer commented on BIT-1215:
---

I haven't looked at the code yet but if there's hard line length
limit in there, that's a problem. bro-cut shouldn't care how long
lines are.




 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer
I haven't looked at the code yet but if there's hard line length
limit in there, that's a problem. bro-cut shouldn't care how long
lines are.


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Slagell, Adam J
We are going to make it configurable and default to like a 1000KB line. 
Otherwise, you add a check to see if you need to reallocate memory for every 
line processed, which seems inefficient for edge cases. Letting the user 
override the default is a good compromise though. 

 On Jul 10, 2014, at 4:30 PM, Robin Sommer (JIRA) 
 j...@bro-tracker.atlassian.net wrote:
 
 I haven't looked at the code yet but if there's hard line length
 limit in there, that's a problem. bro-cut shouldn't care how long
 lines are.

___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Justin Azoff (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17107#comment-17107
 ] 

Justin Azoff commented on BIT-1215:
---

I think start with 1M and realloc 2x as needed is the way to go after all.  We 
need (and already have) the check to see if fgets truncated the line.

I think the only thing to do would be to add an absolute max line length of 64M 
or something to handle the case where someone accidentally runs bro-cut against 
a binary file (like a compressed bro log) that just doesn't contain any 
newlines.

 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Adam Slagell (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17106#comment-17106
 ] 

Adam Slagell commented on BIT-1215:
---

We are going to make it configurable and default to like a 1000KB line. 
Otherwise, you add a check to see if you need to reallocate memory for every 
line processed, which seems inefficient for edge cases. Letting the user 
override the default is a good compromise though. 



 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17108#comment-17108
 ] 

Robin Sommer commented on BIT-1215:
---






Yes. Maybe a bit less than 2x, exponential grows quickly. :)


Would be nicer to recognize that differently, like by not finding a
log header; that way we can give a good error message. If such a check
is in place, I wouldn't actually bother with another double-check on
line length; in the unlikely case that the file has a correct header
but totally broken content, I'm sure there are plenty other cases
where bro-cut would fail, and it seems there's not more here that can
happen in addition than running out of memory (which the OS will
catch).


 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1217) Documentation: include type for vectors

2014-07-10 Thread Johanna Amann (JIRA)
Johanna Amann created BIT-1217:
--

 Summary: Documentation: include type for vectors
 Key: BIT-1217
 URL: https://bro-tracker.atlassian.net/browse/BIT-1217
 Project: Bro Issue Tracker
  Issue Type: Problem
  Components: Bro, Website
Affects Versions: git/master
Reporter: Johanna Amann
 Fix For: 2.4


While browsing our documentation, I noticed that at the moment the script 
reference does not contain the type that is stored inside of a vector.

This would be highly convenient sometimes. At the moment, it is e.g. impossible 
to find out what kind of Data a vector in an Info record contains. See 
http://www.bro.org/sphinx-git/scripts/base/protocols/ssl/main.bro.html#type-SSL::Info
 for an example/



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer



On Thu, Jul 10, 2014 at 17:27 -0500, you wrote:

 I think start with 1M and realloc 2x as needed is the way to go after
 all.

Yes. Maybe a bit less than 2x, exponential grows quickly. :)

 I think the only thing to do would be to add an absolute max line
 length of 64M or something to handle the case where someone
 accidentally runs bro-cut against a binary file (like a compressed bro
 log) that just doesn't contain any newlines.

Would be nicer to recognize that differently, like by not finding a
log header; that way we can give a good error message. If such a check
is in place, I wouldn't actually bother with another double-check on
line length; in the unlikely case that the file has a correct header
but totally broken content, I'm sure there are plenty other cases
where bro-cut would fail, and it seems there's not more here that can
happen in addition than running out of memory (which the OS will
catch).
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1217) Documentation: include type for vectors

2014-07-10 Thread Jon Siwek (JIRA)

 [ 
https://bro-tracker.atlassian.net/browse/BIT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Siwek updated BIT-1217:
---
Resolution: Fixed
Status: Closed  (was: Open)

 Documentation: include type for vectors
 ---

 Key: BIT-1217
 URL: https://bro-tracker.atlassian.net/browse/BIT-1217
 Project: Bro Issue Tracker
  Issue Type: Problem
  Components: Bro, Website
Affects Versions: git/master
Reporter: Johanna Amann
 Fix For: 2.4


 While browsing our documentation, I noticed that at the moment the script 
 reference does not contain the type that is stored inside of a vector.
 This would be highly convenient sometimes. At the moment, it is e.g. 
 impossible to find out what kind of Data a vector in an Info record contains. 
 See 
 http://www.bro.org/sphinx-git/scripts/base/protocols/ssl/main.bro.html#type-SSL::Info
  for an example/



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev